A Novel Classiﬁcation Extension-Based Cloud Detection Method for Medium-Resolution Optical Images

: Accurate cloud detection using medium-resolution multispectral satellite imagery (such as Landsat and Sentinel data) is always di ﬃ cult due to the complex land surfaces, diverse cloud types, and limited number of available spectral bands, especially in the case of images without thermal bands. In this paper, a novel classiﬁcation extension-based cloud detection (CECD) method was proposed for masking clouds in the medium-resolution images. The new method does not rely on thermal bands and can be used for masking clouds in di ﬀ erent types of medium-resolution satellite imagery. First, with the support of low-resolution satellite imagery with short revisit periods, cloud and non-cloud pixels were identiﬁed in the resampled low-resolution version of the medium-resolution cloudy image. Then, based on the identiﬁed cloud and non-cloud pixels and the resampled cloudy image, training samples were automatically collected to develop a random forest (RF) classiﬁer. Finally, the developed RF classiﬁer was extended to the corresponding medium-resolution cloudy image to generate an accurate cloud mask. The CECD method was applied to Landsat-8 and Sentinel-2 imagery to test the performance for di ﬀ erent satellite images, and the well-known function of mask (FMASK) method was employed for comparison with our method. The results indicate that CECD is more accurate at detecting clouds in Landsat-8 and Sentinel-2 imagery, giving an average F-measure value of 97.65% and 97.11% for Landsat-8 and Sentinel-2 imagery, respectively, as against corresponding results of 90.80% and 88.47% for FMASK. It is concluded, therefore, that the proposed CECD algorithm is an e ﬀ ective cloud-classiﬁcation algorithm that can be applied to the medium-resolution optical satellite imagery.


Introduction
Optical remote sensing data ranging from the visible to shortwave infrared are widely used for surface cover mapping, surface parameters estimation, and ecosystem monitoring [1][2][3][4]. Nowadays, with the increasing number of medium-resolution optical sensors, the opportunity for finer monitoring of the global environmental has been provided [5][6][7][8]. Nevertheless, the images from these sensors are usually contaminated by clouds. The existence of this contamination obscures the ground-surface reflectance and reduces the accuracy of the use of optical images in various applications [9,10]. Therefore, accurately identifying and masking clouds is a preprocessing step that should be carried out before using optical imagery.

Study Area
Twelve test sites (Figure 1), selected from areas where the problem of confusion between clouds and background surfaces is most severe, were chosen to evaluate the performance of the CECD for cloud detection. These sites were spread over six continents and covered a wide range of surface environments. For each test site, two scenes were randomly selected to test the performance of the CECD for Landsat-8 and Sentinel-2 data, respectively. These scenes included almost all possible seasons and surface-cover types, from desert to ice sheet. Details of the 24 test scenes are summarized in Table 1.  [39]. Noted: L8_Biome Test Scenes refers to the Landsat-8 test images selected from the Landsat 8 Cloud Cover Assessment Validation Dataset.

Datasets and Preprocessing
Due to its superiority in terms of spatial resolution (300 m) and revisit period (1-2 days) [40], Proba-V imagery was selected to generate cloud-free images. Proba-V was launched in 2013 and it has four optical spectral channels with wavelengths from 0.440 to 1.635 nm ( Table 2) [41]. Level 2A top-of-atmosphere (TOA) reflectance images were used in this study [40]. The clouds and shadows in the Proba-V images were masked based on the Proba-V quality band. Then, Proba-V images from within ± 15 days of the target date were used to produce a composite cloud-free image. The median composition method was used to generate the cloud-free image [42,43]. The median of all unmasked observations for each respective pixel was selected as the derived composite value.
Two types of widely used multispectral satellite data (Landsat-8, Sentinel-2) were selected to test the performance of the CECD. These data each have four spectral bands that overlap with those of  [39]. Noted: L8_Biome Test Scenes refers to the Landsat-8 test images selected from the Landsat 8 Cloud Cover Assessment Validation Dataset.

Datasets and Preprocessing
Due to its superiority in terms of spatial resolution (300 m) and revisit period (1-2 days) [40], Proba-V imagery was selected to generate cloud-free images. Proba-V was launched in 2013 and it has four optical spectral channels with wavelengths from 0.440 to 1.635 nm ( Table 2) [41]. Level 2A top-of-atmosphere (TOA) reflectance images were used in this study [40]. The clouds and shadows in the Proba-V images were masked based on the Proba-V quality band. Then, Proba-V images from within ± 15 days of the target date were used to produce a composite cloud-free image. The median composition method was used to generate the cloud-free image [42,43]. The median of all unmasked observations for each respective pixel was selected as the derived composite value. Table 2. The spectral bands of Landsat-8 and Sentinel-2 data used in this paper. The overlapping bands used for collecting training samples (Section 3.1) are highlighted in bold letters. Two types of widely used multispectral satellite data (Landsat-8, Sentinel-2) were selected to test the performance of the CECD. These data each have four spectral bands that overlap with those of the Proba-V satellite. Since most multispectral data do not include thermal infrared bands, and Joshi et al. [22] proved that clouds can be effectively detected without thermal bands, CECD uses only the top of atmosphere (TOA) reflectance bands ranging from the visible to short-wave infrared bands as inputs for each type of satellite image (Table 2). Because all the spectral bands in Landsat-8 data have a 30-m spatial resolution, the CECD will generate 30-m cloud masks for Landsat-8 data. For the Sentinel-2 satellite data, since they have three different spatial resolutions of 10, 20, and 60 m for different bands, in order to generate a uniform resolution cloud mask for Sentinel-2 imagery, the bands with 10 and 60 m spatial resolutions were resampled to 20 m. So, the cloud masks generated for Sentinel-2 images have a resolution of 20 m.

Validation Dataset
For each test scene, 800 samples were randomly selected. Because we only focused on the accuracy of the cloud detection, the samples were carefully labeled as belonging to one of two categories (cloud or clear) by checking the original satellite images. However, sometimes it is difficult to accurately interpret certain samples located at very thin clouds [14]. In order to ensure the reliability of each validation sample, the difficult-to-interpret samples were excluded from the evaluation process, and samples located at the intersection of clouds and land surfaces were moved to the interior of the relevant clouds. As a consequence, a total of 9362 Landsat-8 samples (3799 cloud samples, 5563 cloud-free samples) and 9341 Sentinel-2 samples (4085 cloud samples, 5256 cloud-free samples) were manually interpreted. To reduce the error in labeling [28], all the validation samples were collected and labeled by one particular scientist.
In order to objectively evaluate the robustness of our algorithm, we also selected some public cloud validation data from Landsat 8 Cloud Cover Assessment Validation Dataset (L8_Biome) (https://landsat.usgs.gov/landsat-8-cloud-coverassessment-validation-data) for independent testing. The L8_Biome was developed by Foga et al. [44], and each scene in the L8_Biome dataset has a manually labeled cloud mask. However, due to the low quality of some masks in L8_Biome, we did not use all the validation scenes in it. Since Li et al. [23] had screened out 19 high quality validation scenes from the L8_biome dataset through careful visual inspection, we chose the same scenes. In addition, because the Proba-V images were required in our calculation process, and they were only available after about November 2013 [41], the scenes before November 2013 were removed from the 19 validation scenes. Finally, a total of 14 L8_Biome scenes were selected for independent testing ( Figure 1). The details of these scenes are listed in Table S1. Since there was no public cloud validation dataset for Sentinel-2 imagery, the independent validation part was only carried out on Landsat-8 images.

Method
The CECD method is designed for masking clouds in images acquired by different medium-resolution optical sensors, and does not require thermal infrared bands. The method consists of three main parts: first, the medium-resolution cloudy image is resampled to a low-resolution one and, together with a composited low-resolution cloud-free image, is used to identify cloud and non-cloud pixels. Secondly, using the identified cloud, non-cloud pixels and the resampled cloudy image, training samples are automatically collected to develop a local adaptive classification model at a low-resolution level for each scene. Finally, the trained classifier is extended to mask clouds in the corresponding medium-resolution image. In our particular case, the composited low-resolution cloud-free image was derived from Proba-V data, and the medium-resolution cloudy images were Landsat-8 and Sentinel-2 images. A flowchart of CECD method is illustrated in Figure 2. 3.1. Identifying Cloud and Non-Cloud Pixels with the Support of Reference Cloud-Free Imagery Our aim is to collect training samples at a low-resolution level. However, distinguishing pixels between bright surfaces and clouds and the detection of thin clouds are the two main obstacles in identifying cloud and non-cloud pixels [12,18]. In order to collect accurate training samples, the

Identifying Cloud and Non-Cloud Pixels with the Support of Reference Cloud-Free Imagery
Our aim is to collect training samples at a low-resolution level. However, distinguishing pixels between bright surfaces and clouds and the detection of thin clouds are the two main obstacles in identifying cloud and non-cloud pixels [12,18]. In order to collect accurate training samples, the medium-resolution cloudy Landsat-8 and Sentinel-2 images were aggregated to the Proba-V resolution to work with the composited cloud-free Proba-V images (see Section 2.2) to figure out these problems.
First, as bright surfaces can be well separated from clouds with the help of a corresponding cloud-free image [16,17], the composited cloud-free Proba-V images and the resampled Landsat-8 and Sentinel-2 cloudy images were used to enhance the difference between clouds and bright surfaces. A temporal haze optimized transformation (THOT) method, which was not sensitive to the errors caused by inconsistencies in reflectance between cloud-free and cloudy images [16], was used for this (Equation (1)).
where ∆R i is the difference in reflectance between the cloudy and cloud-free images for an overlapping band i for a given pixel. The bands used to calculate ∆R in this study are highlighted in bold in Table 2.
n is the number of overlapping bands; k i and c are coefficients determined by a multivariate regression between the haze optimized transformation (HOT) and ∆R i : where ε is the regression residual, and R blue and R red are reflectances at the TOA of the blue and red bands in the cloudy image, respectively. θ is the angle of the "clear line" of the corresponding cloud-free image, and the "clear line" can be fitted by a line regression of the blue and red bands in the cloud-free image [45]. Since ∆R i for non-cloudy pixels is close to zero and smaller than for cloudy ones, the THOT index, which is the linear sum of ∆R i (Equation (1)), can separate clouds from bright surfaces well [16]. Secondly, since the combination of the fuzzy C-means (FCM) algorithm and local information could fully compensate for the poor spectral contrast of the mixed pixels [46], a modified fuzzy C-means (MFCM) method [47] was applied to the generated THOT image to further enhance the contrast between thin clouds and adjacent non-cloud pixels (Equation (3)). By minimizing the objective function, J, a corrected THOT value, x, was derived (the complete calculation process is presented in the Appendix A).
where µ m ij is the probability that a pixel j belongs to cluster i; m is a weighting exponent, for which a default value of 2 was used; x j is the corrected THOT value of pixel j obtained by subtracting the gain field, β j ; N is the total number of pixels; C is the number of clusters, which was set to 2 (cloud and non-cloud); v i is the prototype of the centroid for cluster i; x j is the mean value of the pixels within a 3 × 3 window around pixel j; and a controls the neighboring effect-a default value of 0.3 was used for this. Since Yang et al. [47] conducted comprehensive tests on these parameters, and the default values of all parameters were the optimal values they derived, all the parameters in MFCM were set as the default settings. An example of the application of this method to a HOT image histogram is shown in Figure 3 below. It polarizes the image reflectance distribution of clouds and lands to generate a Remote Sens. 2020, 12, 2365 7 of 21 dual-mode histogram (Figure 3e), which results in an increase in the contrast between thin clouds and neighboring surfaces in the corrected image ( Figure 3c).
was used for this. Since Yang et al. [47] conducted comprehensive tests on these parameters, and the default values of all parameters were the optimal values they derived, all the parameters in MFCM were set as the default settings. An example of the application of this method to a HOT image histogram is shown in Figure 3 below. It polarizes the image reflectance distribution of clouds and lands to generate a dual-mode histogram (Figure 3e), which results in an increase in the contrast between thin clouds and neighboring surfaces in the corrected image ( Figure 3c).
Finally, as the contrast between surfaces and clouds was effectively enhanced, it was reasonable to draw a conclusion that accurate cloud and non-cloud pixels could be easily identified by using an appropriate threshold. A simple automatic thresholding method, OTSU, proposed by Otsu [48], was then applied to the corrected THOT image to identify cloud and non-cloud pixels.

Collection of Training Samples
The training samples were automatically collected using the identified cloud and non-cloud pixels (see Section 3.1) and the blue band of the resampled low-resolution Landsat-8 and Sentinel-2 cloudy image. As the reliability and representativeness of the training samples directly affect the accuracy of classification, a bin-based sample selecting approach proposed by Zhu et al. [1] was employed to search for representative cloud and non-cloud samples with different brightnesses. We first divided the 0-1.0 range of the blue reflectance values in the resampled cloudy image into 5 bins with equal intervals. Then, each bin was divided into cloud and non-cloud sub-bins based on the identified cloud and non-cloud pixels, and the erosion operator [49] was used to remove the "saltand-pepper" pixels in each sub-bin. Finally, training samples of clouds and non-cloud surfaces were Finally, as the contrast between surfaces and clouds was effectively enhanced, it was reasonable to draw a conclusion that accurate cloud and non-cloud pixels could be easily identified by using an appropriate threshold. A simple automatic thresholding method, OTSU, proposed by Otsu [48], was then applied to the corrected THOT image to identify cloud and non-cloud pixels.

Collection of Training Samples
The training samples were automatically collected using the identified cloud and non-cloud pixels (see Section 3.1) and the blue band of the resampled low-resolution Landsat-8 and Sentinel-2 cloudy image. As the reliability and representativeness of the training samples directly affect the accuracy of classification, a bin-based sample selecting approach proposed by Zhu et al. [1] was employed to search for representative cloud and non-cloud samples with different brightnesses. We first divided the 0-1.0 range of the blue reflectance values in the resampled cloudy image into 5 bins with equal intervals. Then, each bin was divided into cloud and non-cloud sub-bins based on the identified cloud and non-cloud pixels, and the erosion operator [49] was used to remove the "salt-and-pepper" pixels in each sub-bin. Finally, training samples of clouds and non-cloud surfaces were derived from the remaining pixels in each sub-bin. In order to optimize the distribution of the samples, samples were collected from sub-bins in proportion to the area occupied by each sub-bin [50,51].

Modeling of Random Forest Classifier
As a local adaptive classification model can achieve a higher classification accuracy than a single global model [52], a local adaptive model was developed for each scene based on the collected training samples from that scene (Section 3.2.1). First, the spectral data of the training samples in the resampled low-resolution Landsat-8 and Sentinel-2 image were extracted to build the training data. Due to the complex spectral characteristics of clouds and surface objects, objects can exhibit similar spectral behavior to clouds in certain spectral bands [32]. It is a challenge to accurately detect clouds using a limited number of spectral bands [23]. Therefore, in order to make full use of the spectral information, all Landsat-8 and Sentinel-2 bands mentioned in Table 2 were used. Then, a local adaptive Remote Sens. 2020, 12, 2365 8 of 21 classification model was trained using the training data derived from the low-resolution Landsat-8 and Sentinel-2 image. It should be noted that, since our method is a local adaptive classification method, it is not sensitive to radiometric and reflectance calibration [53]. Therefore, both top-of-atmosphere and surface reflectance data can be used for model training. Finally, the modelled classifier was extended to the original medium-resolution Landsat-8 and Sentinel-2 image to classify cloudy pixels.
According to the previous investigations [37,52], the RF classifier is capable of processing highly dimensional multicollinear data, and rarely affected by noise and feature selection. In addition, it proves to be more accurate and efficient than other widely used classifiers such as the support vector machine (SVM), artificial neural network (ANN), and classification and regression tree (CART) classifiers [54][55][56][57][58]. Considering the advantages of RF classifier, it was employed in this study. The RF classifier has two parameters: the number of classification trees (Ntree) and the number of selected predication features (Mtry). Since many researchers have demonstrated that there is almost no correlation between these two parameters and the classification accuracy [54], a value of 100 for Ntree and the square root of the total number of training features for Mtry were selected.

Cloud Masking in Landsat Imagery
The cloud-detection results for the 12 Landsat-8 test scenes are illustrated in Figure 4: it can be seen that there is significant agreement between the CECD cloud cover and the actual cloud distribution. All kinds of cloud-contaminated pixels, including thick and thin clouds, are accurately separated from snow, bright sand, bright impervious surfaces and bright rocks-surfaces that are usually confused with clouds using traditional methods. Therefore, the proposed CECD can effectively extract clouds and mitigate the disturbances caused by other bright land surfaces with similar spectral reflectance. For example, a large proportion of snow, bright bare lands, and bright impervious surfaces are shown in Figure 4b-e,g,h,k, but there is no overestimation of clouds in the corresponding CECD results. Moreover, the thin clouds over different land-cover types are also accurately captured in the detection results (Figure 4a,c,f,j,l). It can be observed that the details of the cloud boundaries and internal structures are well described in our cloud-mask maps, which further reveals the effectiveness of the proposed CECD.   Table 1). The upper row is the color-composited cloudy images. The lower row is the corresponding CECD cloud masks (in orange).

Cloud Masking in Sentinel-2 Imagery
The second experiment was conducted on the 12 Sentinel-2 test images, and the corresponding cloud-detection results are given in Figure 5. Unlike for Landsat-8 imagery, due to the lack of thermal bands, the separation between clouds and other bright surfaces is always a big problem when using conventional automated cloud-detection methods for Sentinel-2 images [12,14]. However, this issue can be greatly alleviated by using CECD (Figure 5). Both thick and thin clouds are well separated  Table 1). The upper row is the color-composited cloudy images. The lower row is the corresponding CECD cloud masks (in orange).

Cloud Masking in Sentinel-2 Imagery
The second experiment was conducted on the 12 Sentinel-2 test images, and the corresponding cloud-detection results are given in Figure 5. Unlike for Landsat-8 imagery, due to the lack of thermal bands, the separation between clouds and other bright surfaces is always a big problem when using conventional automated cloud-detection methods for Sentinel-2 images [12,14]. However, this issue can be greatly alleviated by using CECD (Figure 5). Both thick and thin clouds are well separated from the bright surfaces in the results shown. For instance, the scenes in Figure 5b,d,e,h-k are very complex cases containing snow, bright bare lands, and bright rocks. The cloud pixels are also well identified in the corresponding CECD cloud masks. In addition, the low-altitude thin clouds that are usually underestimated by traditional cloud-detection methods are also effectively detected by the CECD cloud masks (Figure 5a,d,f-j,l). Therefore, it can be said that the CECD method can also achieve a reasonably good performance when applied to Sentinel-2 images without using the thermal infrared bands.  Table 1). The upper row is color-composited cloudy images. The lower row is the corresponding CECD cloud masks (in orange).

Comparison with FMASK Cloud-Detection Algorithm
The CECD was compared with the function of mask (FMASK) algorithm for the 24 Landsat-8 and Sentine-2 test scenes (see Section 2.1). FMASK is a cloud-mask algorithm that was designed for masking clouds in Landsat and Sentinel-2 imagery, and it was the operational method used for Landsat data products [11]. The latest FMASK version, FMASK 4.0 [14] (https://github.com/gersl/fmask), which is the state-of-the-art FMASK cloud detection technique, was used for comparison with the CECD. In order to ensure the objectivity of the comparison, the dilation parameter in FMASK was set to 0 to be consistent with the CECD. Except for this, because the default values of all other parameters in FMASK are the optimal thresholds obtained by performing sensitivity analysis using images in different environments around the world [14], these default thresholds are applicable to different types of clouds and different regions. Therefore, all the other parameters in FMASK were set to the default parameter settings described in Qiu et al. (2019).
To quantitatively evaluate the accuracy of the results, three traditional metrics [59] including the user accuracy (U.A.), producer accuracy (P.A.), and kappa coefficient (K.C.) were calculated using the validation samples independently selected from each test scene (Section 2.3). In addition, the Fmeasure (F.M.), which is a single class-specific accuracy metric and the complement of the commission, omission, and overall error, was also calculated using a balanced weighting (β=1) [22,60,61]. Table 3 summarizes the quantitative accuracy results, and Figure 6 illustrates a comparison between the cloud-detection results produced by CECD and FMASK for three typical Landsat-8 scenes and three Sentinel-2 scenes. As shown in Table 3 and Figure 6, for both algorithms, the overall accuracy was high for most of the test scenes. However, CECD produced a more robust performance than FMASK in the case of both Landsat-8 and Sentinel-2 images. The average F-measure value for CECD was 97.65% and 97.11% for Landsat-8 and Sentinel-2, respectively. In comparison, FMASK gave an average F-measure value of 90.80% and 88.47% for the Landsat-8 and Sentinel-2 imagery,  Table 1). The upper row is color-composited cloudy images. The lower row is the corresponding CECD cloud masks (in orange).

Comparison with FMASK Cloud-Detection Algorithm
The CECD was compared with the function of mask (FMASK) algorithm for the 24 Landsat-8 and Sentine-2 test scenes (see Section 2.1). FMASK is a cloud-mask algorithm that was designed for masking clouds in Landsat and Sentinel-2 imagery, and it was the operational method used for Landsat data products [11]. The latest FMASK version, FMASK 4.0 [14] (https://github.com/gersl/fmask), which is the state-of-the-art FMASK cloud detection technique, was used for comparison with the CECD. In order to ensure the objectivity of the comparison, the dilation parameter in FMASK was set to 0 to be consistent with the CECD. Except for this, because the default values of all other parameters in FMASK are the optimal thresholds obtained by performing sensitivity analysis using images in different environments around the world [14], these default thresholds are applicable to different types of clouds and different regions. Therefore, all the other parameters in FMASK were set to the default parameter settings described in Qiu et al. (2019).
To quantitatively evaluate the accuracy of the results, three traditional metrics [59] including the user accuracy (U.A.), producer accuracy (P.A.), and kappa coefficient (K.C.) were calculated using the validation samples independently selected from each test scene (Section 2.3). In addition, the F-measure (F.M.), which is a single class-specific accuracy metric and the complement of the commission, omission, and overall error, was also calculated using a balanced weighting (β = 1) [22,60,61]. Table 3 summarizes the quantitative accuracy results, and Figure 6 illustrates a comparison between the cloud-detection results produced by CECD and FMASK for three typical Landsat-8 scenes and three Sentinel-2 scenes. As shown in Table 3 and Figure 6, for both algorithms, the overall accuracy was high for most of the test scenes. However, CECD produced a more robust performance than FMASK in the case of both Landsat-8 and Sentinel-2 images. The average F-measure value for CECD was 97.65% and 97.11% for Landsat-8 and Sentinel-2, respectively. In comparison, FMASK gave an average F-measure value of 90.80% and 88.47% for the Landsat-8 and Sentinel-2 imagery, respectively. FMASK did not perform as well with the Sentinel-2 dataset as it did with the Landsat-8 imagery. For instance, it can be observed that FMASK omits many low-altitude clouds in the corresponding Sentinel-2 cloud mask (Figure 6f), while even the very thin clouds are well depicted in the Landsat-8 classification results (Figure 6e). Without thermal bands, it is problematic for FMASK to detect low-altitude thin clouds. Moreover, for scenes containing a large amount of snow/ice and bright impervious surfaces (Landsat-8 scenes at site b, c, h, k, and l; Sentinel-2 scenes at site b, c, e, h, i, and k), the accuracy of FMASK is much less than that of CECD (average F-measure value of CECD is 97.65% and 96.19% as against 83.59% and 81.57% for FMASK for the Landsat-8 and Sentinel-2 images, respectively). As clearly illustrated by Figure 6a-d, there is noticeable misclassification by the FMASK method in these scenes. A large number of bright surfaces are mistakenly labeled as clouds in the FMASK results, whereas the CECD performs well for these scenes.  To comprehensively evaluate the performance of FMASK and CECD in relation to specific surface cover types, the pixels in the 24 test scenes (Table 1) were visually classified as vegetation, bare land, impervious, snow/ice, or water classes based on their spectral characteristics. The average cloud-detection accuracies of the two algorithms for these surface covers were calculated and shown in Figure 7. The robust non-parametric McNemar's statistics (chi-square (χ2) and 95% confidence interval probability) [62] were calculated to evaluate the statistical significance of differences (refer Table 4). Note that only when p < 0.05 and χ 2 >3.841 does it indicate a statistically significant change, and all these accuracy metrics were derived using a sample set aggregated from the 24 selected test scenes ( Table 1). The two sets of results show the highest statistically significant differences in the snow/ice, followed by the impervious regions. As for barren lands, the cloud detection of the two To comprehensively evaluate the performance of FMASK and CECD in relation to specific surface cover types, the pixels in the 24 test scenes (Table 1) were visually classified as vegetation, bare land, impervious, snow/ice, or water classes based on their spectral characteristics. The average cloud-detection accuracies of the two algorithms for these surface covers were calculated and shown in Figure 7. The robust non-parametric McNemar's statistics (chi-square (χ 2 ) and 95% confidence interval probability) [62] were calculated to evaluate the statistical significance of differences (refer Table 4). Note that only when p < 0.05 and χ 2 > 3.841 does it indicate a statistically significant change, and all these accuracy metrics were derived using a sample set aggregated from the 24 selected test scenes ( Table 1). The two sets of results show the highest statistically significant differences in the snow/ice, followed by the impervious regions. As for barren lands, the cloud detection of the two methods was significantly different in the Sentinel-2 images, but there is no noticeable advantage at the chi-square and the 95% significance level in the Landsat-8 scenes. Moreover, the performance of CECD was comparable to that of FMASK in the water and vegetation areas. These difference were also notable in the accuracy assessment of Figure 7. Compared with FMASK, the CECD results show a significant improvement for the snow/ice-covered regions in the Landsat-8 (F-measure of 96% against 88%) and Sentinel-2 images (F-measure of 95% against 82%), and impervious areas were also improved (F-measure of 96% against 90% for Landsat-8, and 97% against 85% for Sentinel-2). In addition, the improvement of CECD for the Sentinel-2 images was higher than that for the Landsat-8 images in the barren land surfaces. Furthermore, both FMASK and CECD produced good results for areas of both vegetation and water.
the chi-square and the 95% significance level in the Landsat-8 scenes. Moreover, the performance of CECD was comparable to that of FMASK in the water and vegetation areas. These difference were also notable in the accuracy assessment of Figure 7. Compared with FMASK, the CECD results show a significant improvement for the snow/ice-covered regions in the Landsat-8 (F-measure of 96% against 88%) and Sentinel-2 images (F-measure of 95% against 82%), and impervious areas were also improved (F-measure of 96% against 90% for Landsat-8, and 97% against 85% for Sentinel-2). In addition, the improvement of CECD for the Sentinel-2 images was higher than that for the Landsat-8 images in the barren land surfaces. Furthermore, both FMASK and CECD produced good results for areas of both vegetation and water.  An independent validation experiment was performed using the 14 Landsat-8 images selected from the L8_Biome dataset ( Figure 1). All the non-cloud classes (clear and cloud shadow) in the reference mask of each image were merged into the non-cloud class, and the cloud classes (cloud and thin cloud) were aggregated to the cloud class. These processed cloud masks were used as ground truth to compare the robustness of CECD and FMASK (Figure 8). It can be seen from the results that the accuracy of CECD and FMASK are both high in these scenes, and the accuracy of CECD is higher than that of FMASK (F-measure value of 96.95% against 93.45%). Moreover, the standard deviation error of CECD on each accuracy metric is less than that of FMASK. Therefore, according to all results in Figures 6-8, and the accuracy assessment in Tables 3 and 4, it can be concluded that our CECD method is a robust cloud detection method.  An independent validation experiment was performed using the 14 Landsat-8 images selected from the L8_Biome dataset ( Figure 1). All the non-cloud classes (clear and cloud shadow) in the reference mask of each image were merged into the non-cloud class, and the cloud classes (cloud and thin cloud) were aggregated to the cloud class. These processed cloud masks were used as ground truth to compare the robustness of CECD and FMASK (Figure 8). It can be seen from the results that the accuracy of CECD and FMASK are both high in these scenes, and the accuracy of CECD is higher than that of FMASK (F-measure value of 96.95% against 93.45%). Moreover, the standard deviation error of CECD on each accuracy metric is less than that of FMASK. Therefore, according to all results in Figures 6-8, and the accuracy assessment in Tables 3 and 4, it can be concluded that our CECD method is a robust cloud detection method.

Effectivenss of Classification Extention Strategy in Cloud Detection
Zhang et al. [36] proposed a classification extension strategy for land cover mapping. They developed local adaptive classifiers using the 500 m resolution MODIS surface reflectance dataset and extended them to classify the corresponding 30 m Landsat surface reflectance images. This strategy is able to overcome the difficulty of collecting training samples in large-scale land cover mapping [36]. However, it was based on the assumption that the two different remote sensing datasets have highly consistent spectral reflectance. In this study, the training spectral data were derived from the resampled Landsat-8 and Sentinel-2 images, and the target datasets for model extension were the original medium-resolution Landsat-8 and Sentinel-2 datasets themselves. Therefore, except for the errors caused by resampling, the spectral signatures used for training were basically the same as those of target datasets. All selected Landsat-8 images (Table 1) were used as a case to illustrate the consistency between the original medium-resolution TOA reflectance and the resampled 300 m low-resolution TOA reflectance ( Figure 9). As shown in Figure 9, all eight bands of the original Landsat-8 OLI data were in a good agreement with the resampled ones, and the average coefficient of determination (R 2 ) and the root mean square error (RMSE) of all bands are higher than 0.88 and less than 0.053, respectively. Similarly, Li et al. [23] found that the model trained at a resolution 10 times lower than the spatial resolution of the target image could still be effectively used to detect clouds of the original target image. Therefore, the classification extension-based cloud detection method can be considered as a suitable method for classifying cloudy pixels of mediumresolution optical images.

Effectivenss of Classification Extention Strategy in Cloud Detection
Zhang et al. [36] proposed a classification extension strategy for land cover mapping. They developed local adaptive classifiers using the 500 m resolution MODIS surface reflectance dataset and extended them to classify the corresponding 30 m Landsat surface reflectance images. This strategy is able to overcome the difficulty of collecting training samples in large-scale land cover mapping [36]. However, it was based on the assumption that the two different remote sensing datasets have highly consistent spectral reflectance. In this study, the training spectral data were derived from the resampled Landsat-8 and Sentinel-2 images, and the target datasets for model extension were the original medium-resolution Landsat-8 and Sentinel-2 datasets themselves. Therefore, except for the errors caused by resampling, the spectral signatures used for training were basically the same as those of target datasets. All selected Landsat-8 images (Table 1) were used as a case to illustrate the consistency between the original medium-resolution TOA reflectance and the resampled 300 m low-resolution TOA reflectance ( Figure 9). As shown in Figure 9, all eight bands of the original Landsat-8 OLI data were in a good agreement with the resampled ones, and the average coefficient of determination (R 2 ) and the root mean square error (RMSE) of all bands are higher than 0.88 and less than 0.053, respectively. Similarly, Li et al. [23] found that the model trained at a resolution 10 times lower than the spatial resolution of the target image could still be effectively used to detect clouds of the original target image. Therefore, the classification extension-based cloud detection method can be considered as a suitable method for classifying cloudy pixels of medium-resolution optical images. user's accuracy of non-cloud; K. C., kappa coefficient.

Effectivenss of Classification Extention Strategy in Cloud Detection
Zhang et al. [36] proposed a classification extension strategy for land cover mapping. They developed local adaptive classifiers using the 500 m resolution MODIS surface reflectance dataset and extended them to classify the corresponding 30 m Landsat surface reflectance images. This strategy is able to overcome the difficulty of collecting training samples in large-scale land cover mapping [36]. However, it was based on the assumption that the two different remote sensing datasets have highly consistent spectral reflectance. In this study, the training spectral data were derived from the resampled Landsat-8 and Sentinel-2 images, and the target datasets for model extension were the original medium-resolution Landsat-8 and Sentinel-2 datasets themselves. Therefore, except for the errors caused by resampling, the spectral signatures used for training were basically the same as those of target datasets. All selected Landsat-8 images (Table 1) were used as a case to illustrate the consistency between the original medium-resolution TOA reflectance and the resampled 300 m low-resolution TOA reflectance ( Figure 9). As shown in Figure 9, all eight bands of the original Landsat-8 OLI data were in a good agreement with the resampled ones, and the average coefficient of determination (R 2 ) and the root mean square error (RMSE) of all bands are higher than 0.88 and less than 0.053, respectively. Similarly, Li et al. [23] found that the model trained at a resolution 10 times lower than the spatial resolution of the target image could still be effectively used to detect clouds of the original target image. Therefore, the classification extension-based cloud detection method can be considered as a suitable method for classifying cloudy pixels of mediumresolution optical images.

Reliability of the Collected Training Samples
In this study, we used the THOT and the MFCM methods to separate clouds from background surfaces to obtain accurate training samples. The THOT method was used to enhance the difference between clouds and bright surfaces, and the MFCM method was to highlight the features of thin cloud pixels. Because the reliability of training samples is directly related to the accuracy of classification [63], it is also very important to evaluate the accuracy (proportion of correct samples) of training samples. Generally, evaluating the accuracy of all training samples is difficult and time-consuming. Fortunately, since each scene in the L8_Biome dataset has a manually labeled cloud mask, we measured the reliability of the training samples in Landsat-8 imagery by comparing the training samples collected in the 14 selected L8_Biome images with their corresponding cloud masks. We found that these training samples achieved overall accuracies of 96.97% and 93.58% for clouds and non-cloud surfaces in the Landsat-8 scenes, respectively. In addition, since there were no available cloud masks for Sentinel-2 imagery, the accuracy of training samples in Sentinel-2 images was validated by visual inspection. We randomly selected 200 training samples from each Sentinel-2 test scene (Table 1), and visually checked the correctness of the samples based on the original images. The overall accuracies were 97.6% and 94.5% in the Sentinel-2 scenes for clouds and non-cloud samples, respectively. Although the training samples still contained a small number of erroneous points, the random forest model was demonstrated to be resistant to noise and the presence of erroneous samples [54]. Meanwhile, Gong et al. [64] found that the decrease in overall accuracy of RF classifier was less than 1% when the error in the training samples was less than 20%. Therefore, the training samples derived in this study were accurate enough for use in detecting cloud pixels.

Computational Efficiency
In order to comprehensively analyze the efficiency of our algorithm, we took the 12 Landsat-8 test scenes (Table 1) as an example and compared the CECD with two other popular cloud detection approaches: the rule-based approach and the operational single classifier-based machine learning approach (classify images using a single global classifier) ( Figure 10). The FMASK 4.0 was used as a rule-based example, and the single classifier-based machine learning (hereafter SCML) approach was realized by using a single RF classifier trained from all validation samples (Section 2.3). Considering that the FMASK method included the process of cloud shadow detection, in order to ensure the objectivity of the comparison, the time consumed by the cloud shadow detection in FMASK was excluded. The hardware environment of our test was: CPU: i7-4720HQ on a 2.6 GHZ, RAM: 16GB. Remote Sens. 2020, 12, x FOR PEER REVIEW 16 of 22

The Importance of Input Features for Different Environments
In this paper, we trained the CECD model using all optical bands of Landsat-8 and Sentinel-2 imagery (Table 2). However, it is also important to evaluate the need to use multi-spectral information for cloud masking. To quantify the importance of each band for cloud detection in different environments, the 24 test scenes (see Table 1) were visually classified into five landscapes: vegetation, bare land, impervious, snow/ice, and water. Since the RF model can measure the to 5 min for different scenes. Since the efficiency of FMASK depends on the size and complexity of images [14], the time FMASK spent on different images varied greatly. In addition, since our CECD method trained local adaptive classifiers at a low-resolution scale (300 m), the time cost of training was relatively small for each scene (about 5 s), and thus the efficiency of our method was very close to that of using a single global classifier. Therefore, our CECD algorithm can be regarded as suitable as an operational method.

The Importance of Input Features for Different Environments
In this paper, we trained the CECD model using all optical bands of Landsat-8 and Sentinel-2 imagery (Table 2). However, it is also important to evaluate the need to use multi-spectral information for cloud masking. To quantify the importance of each band for cloud detection in different environments, the 24 test scenes (see Table 1) were visually classified into five landscapes: vegetation, bare land, impervious, snow/ice, and water. Since the RF model can measure the importance of a variable by calculating 'out-of-bag' data for the variable while keeping all the other variables constant [65], this operation was carried out using the RF model, and the results were obtained using all samples from the 24 selected scenes, as shown in Figure 11.

Limitations of the Use of CECD for Cloud Detection
Firstly, to produce a composite cloud-free image, multi-temporal low-resolution images are needed. Fortunately, there are already some composited cloud-free low-resolution datasets available that can be used directly, such as MOD09A1. In addition, with the advent of the Google Earth Engine (GEE) platform, the requirements for compositing cloud-free images can easily be met by using the free-access GEE cloud-computation platform [66].
Secondly, although our proposed CECD can easily be adapted for use with different sensors, it may fail in some complex scenes for imagery with only visible and near-infrared reflectance channels, as the SWIR bands are crucial for distinguishing the snow/ice class from clouds [11,30]. Therefore, in the absence of SWIR bands, the CECD may give a reduced accuracy in high-altitude snow-covered regions. Due to the limited number of features that can be used to discriminate between classes and the similar spectral reflectance, snow/ice pixels are often confused with clouds in the visible and nearinfrared bands.
Thirdly, although the classifier derived from the resampled low-resolution Landsat-8 and The results indicate that the coastal and visible bands contributed the most to mask clouds over vegetation and barren land surfaces. Meanwhile, the cirrus band of Landsat-8 and the water vapor and cirrus bands in Sentinel-2 images also had a great contribution to cloud detection over barren land surfaces. In addition, as for cloud detection over the impervious regions, the two water vapor absorption bands (940 nm water vapor band and 1380 nm cirrus band) in Sentinel-2 imagery were found to be the most important features, followed by the coastal and blue bands. In comparison, the coastal, visible, and cirrus bands had the greatest contribution to cloud detection over impervious areas in Landsat-8 images. Moreover, when detecting clouds over snow/ice areas, the SWIR bands had been shown to be the most important for Sentinel-2 and Landsat-8 imagery. As for water regions, except for the cirrus band, the importance of other bands was basically the same in the Landsat-8 imagery, and the NIR and SWIR2 bands had the greatest contribution. Similarly, it was found that the contribution of each Sentinel-2 band was also not much different when detecting clouds over water, and the NIR and water vapor bands were shown to be the most important features. Due to the different importance of each band in different environments, it is necessary to make full use of spectral information when masking clouds in complex landscapes.

Limitations of the Use of CECD for Cloud Detection
Firstly, to produce a composite cloud-free image, multi-temporal low-resolution images are needed. Fortunately, there are already some composited cloud-free low-resolution datasets available that can be used directly, such as MOD09A1. In addition, with the advent of the Google Earth Engine (GEE) platform, the requirements for compositing cloud-free images can easily be met by using the free-access GEE cloud-computation platform [66].
Secondly, although our proposed CECD can easily be adapted for use with different sensors, it may fail in some complex scenes for imagery with only visible and near-infrared reflectance channels, as the SWIR bands are crucial for distinguishing the snow/ice class from clouds [11,30]. Therefore, in the absence of SWIR bands, the CECD may give a reduced accuracy in high-altitude snow-covered regions. Due to the limited number of features that can be used to discriminate between classes and the similar spectral reflectance, snow/ice pixels are often confused with clouds in the visible and near-infrared bands.
Thirdly, although the classifier derived from the resampled low-resolution Landsat-8 and Sentinel-2 images was successfully used to classify clouds in the corresponding 30 m and 20 m images, this classifier may not be suitable for classifying clouds in images with very high spatial resolutions (such as images with a resolution higher than 5 m). Dorji et al. [67] evaluated the impact of different spatial resolution images on the observation results, and found that the characteristics of the object in the image with a resolution of 2 m are completely different from the characteristics in the image with 1 km resolution. When the difference in spatial resolution is too large, the spectra of the training dataset taken from the resampled low-resolution image will be a bit different from those obtained from the high-resolution cloudy image, resulting in classification errors. Therefore, more investigation should be undertaken to analyze the impact of the resolution difference between the resampled low-resolution imagery and the original high-or medium-resolution cloudy imagery.

Conclusions
Due to the complex surface structures and the variable cloud types, cloud detection is usually a challenging task for optical images, especially when using imagery that does not have thermal bands. Although machine learning-based methods can improve the accuracy of cloud detection, the lack of accurate and representative training datasets is a major drawback hindering the development of these methods.
In this paper, a novel classification extension-based cloud detection (CECD) algorithm was proposed for masking clouds in medium-resolution optical remote sensing imagery. In contrast to other classification-related methods, our CECD method overcame the expense of collecting representative and sufficient training samples by using a classification extension strategy. It first identified cloud and non-cloud pixels in a resampled low-resolution version of the medium-resolution cloudy image with the support of low-resolution satellite imagery that had a short revisit period. Based on the identified cloud and non-cloud pixels and the resampled low-resolution cloudy image, training samples were then automatically collected to train a RF classification model. Finally, the developed RF classification model was extended to mask clouds in the corresponding medium-resolution image. The CECD method was tested using Landsat-8 and Sentinel-2 images at 12 sites, and compared to the results of using the classic FMASK method. The validation results show that the CECD was more accurate and robust than FMASK, and gave an average F-measure value of 97.65% and 97.11% as against 90.80% and 88.47% for FMASK using Landsat-8 and Sentinel-2 imagery, respectively. Therefore, it can be concluded that the proposed CECD is an effective cloud-detection algorithm that can be used for masking clouds in multispectral optical satellite images. Furthermore, as the CECD is a sample-driven method, it can be directly applied to other types of optical satellite imagery.

Conflicts of Interest:
The authors declare no conflict of interest.
Data availability: Our CECD model in IDL is now available for free download at https://github.com/liuliangyun01/CECD.

Appendix A. Estimation of the Corrected THOT Value by a Minimization of Objective Function
To address the problem of minimizing the objective function (Equation (3)), one Lagrange multiplier can be introduced to the objective function to construct a constrained function O m .
where λ is the Lagrange multiplier, and C i=1 µ ij = 1. Then, by setting the derivative of the Lagrange function with respect to µ ij , v i , and β j to zero, an optimal solution can be obtained, which results in the following three optimal parameters: the probability of the pixel j belongs to the cloud class: the prototype of the centroid for cluster i: and the gain field of the pixel j: The cluster prototypes v are continuously updated until the centroids distance between two consecutive iterations is less than 0.001. When the iteration ends, the derived β j is the desired optimal bias field, and the THOT value can be corrected by THOT j − β j . The initial cluster centers v i are derived