Cloud Detection Using an Ensemble of Pixel-Based Machine Learning Models Incorporating Unsupervised Classiﬁcation

: Remote sensing imagery, such as that provided by the United States Geological Survey (USGS) Landsat satellites, has been widely used to study environmental protection, hazard analysis, and urban planning for decades. Clouds are a constant challenge for such imagery and, if not handled correctly, can cause a variety of issues for a wide range of remote sensing analyses. Typically, cloud mask algorithms use the entire image; in this study we present an ensemble of different pixel-based approaches to cloud pixel modeling. Based on four training subsets with a selection of different input features, 12 machine learning models were created. We evaluated these models using the cropped LC8-Biome cloud validation dataset. As a comparison, Fmask was also applied to the cropped scene Biome dataset. One goal of this research is to explore a machine learning modeling approach that uses as small a training data sample as possible but still provides an accurate model. Overall, the model trained on the sample subset (1.3% of the total training samples) that includes unsupervised Self-Organizing Map classiﬁcation results as an input feature has the best performance. The approach achieves 98.57% overall accuracy, 1.18% cloud omission error, and 0.93% cloud commission error on the 88 cropped test images. By comparison to Fmask 4.0, this model improves the accuracy by 10.12% and reduces the cloud omission error by 6.39%. Furthermore, using an additional eight independent validation images that were not sampled in model training, the model trained on the second largest subset with an additional ﬁve features has the highest overall accuracy at 86.35%, with 12.48% cloud omission error and 7.96% cloud commission error. This model’s overall correctness increased by 3.26%, and the cloud omission error decreased by 1.28% compared to Fmask 4.0. The machine learning cloud classiﬁcation models discussed in this paper could achieve very good performance utilizing only a small portion of the total training pixels available. We showed that a pixel-based cloud classiﬁcation model, and that as each scene obviously has unique spectral characteristics, and having a small portion of example pixels from each of the sub-regions in a scene can improve the model accuracy signiﬁcantly.


Introduction
Remote sensing imagery, such as that provided by the United States Geological Survey (USGS) Landsat satellites, has been widely used to study environmental protection, hazard analysis, and urban planning for decades. The usefulness and applications of remote sensing imagery continue to expand as more image-based models and algorithms have emerged so that we can derive more knowledge and information from satellite images. Due to the ubiquity of clouds, cloud pixels are a persistent presence in such imagery, especially in tropical areas. A study estimates that about 67% of the earth's surface is typically covered by cloud based on satellite data from July 2002 to April 2015 [1]. The presence of cloud has a serious impact on the use of remote sensing images. Cloud areas imagery [17]. Zi et al. [18] proposed a novel cloud detection method for Landsat 8 images by using Simple Linear Iterative Cluster (SLIC), PCA Network (PCANet), Support Vector Machine (SVM), and Conditional Random Field (CRF). This approach combines statistical models, classical machine learning methods, and a deep learning network to generate a robust model for cloud detection and achieves an accurate result. Yang et al. [19] proposed a CNN-Based cloud detection method by using thumbnails of remote sensing images instead of the original remote sensing images. To handle the coarse resolution of thumbnail images, a cloud detection neural network feature pyramid module, and boundary refinement block techniques are employed to generate accurate cloud prediction results. This work has been further extend by Guo [20] for cloud and snow coexistence scenarios by proposing a new model DSnetV2.
In addition to cloud detection, deep learning-based techniques have been extensively applied to remote sensing image classification problems, especially for hyper-spectral images. Graph convolutional networks (GCNs) are a new emerging network architecture that can handle and model long-range spatial relations. Shahraki and Prasad [21] proposed a cascade framework of 1-D CNNs and GCNs for a hyper-spectral classification problem. Qin et al. [22] extended the GCNs by considering spatial and spectral neighbors. Pu et al. [23] proposed a localized graph convolutional filtering-based GCNs method for hyper-spectral image classification. Traditional GCNs are computationally expensive because the spatial matrices are constructed. Hong et al. [24] showed that miniGCNS can be trained in minibatch fashion for classification problems. The miniGCNs are more robust, and are capable of handling out-of-samples with lower computation cost compared to traditional GCNs.
Based on the data required, these cloud detection methods can be divided into single temporal and multi-temporal approaches. The single temporal approach seeks to identify cloud pixels based on imagery at a single time, while a multi-temporal approach makes use of imagery from multiple comparable timeframes for a same area to identify cloud pixels by comparing the pixel differences between cloud free images and cloudy ones [25]. Depending on the algorithm used, cloud detection can be categorized into a classical algorithm-based approach and machine learning approach [26]. The classical algorithm refers to methods which have specific steps to be followed for input imagery and to generate the output mask like the FMask [4]. On the other hand, machine learning approaches take the advantage of existing data and learn from a training set without human interference to generate an output [18].
Cloud detection is still challenging in these aspects. First, cloud pixels are hard to identify from "bright" areas such as snow by traditional rule-based approaches. The multi-temporal approach requires more than one image at the same location, which is less applicable for low temporal resolution satellites, such as Landsat. While deep learningbased approaches generally enable better cloud detection accuracy, they require a high performance GPU which may not be available. Some other approaches require additional information in combination with the spectral features to improve the performance, which requires extra labor efforts.

A Pixel-Based Approach Using an Ensemble of Learners
This paper has four distinct goals. (1) Propose a cloud detection modeling approach for cloud detection by only using the 10 wavelength bands available at a single pixel as an information source without the need of any other ancillary data. (2) Engineer important predictor features that could increase the model accuracy. (3) Investigate the influence of the training sample size on the model accuracy, thus finding the smallest possible training sample size that could balance the training time and machine learning model accuracy. (4) Explore the importance rank of predictors and the optimal hyper-parameter settings for cloud mask prediction.
In contrast to the whole image-based approaches just described, and to mitigate the challenges just enumerated, in this study we have taken a pixel-based approach that only requires a single image. We have used an ensemble machine learning approach that has been tested with Landsat 8 imagery.
Compared with the multi-temporal approaches [25], this single-temporal approach is more feasible because of the mono-temporal feature of Landsat data in nature. Unlike other classical approaches that need auxiliary data [4], this approach only requires the information on the 10 wavelength bands for a single pixel of Landsat 8 imagery. In comparison to machine learning approaches under a deep learning frame work, the computational facility is not as demanding as deep learning, so a GPU is not needed in this research. This proposed machine learning model uses an ensemble approach which simultaneously employs multiple decision tree learners for cloud detection. The input parameters are tuned in two ways. On the one hand we optionally include unsupervised classification results from the application of a self-organizing map (SOM) as one of the input features for the ensemble model training. On the other hand, 5 indices that are calculated from the 10 wavelength bands signal are included as input features for model training.
Four distinct training subsets were generated from the Biome cloud validation dataset for model training [6]. Models are generated based on different sets of input parameters and different training samples. Then, their performance is compared against Fmask 4.0.

Data Sources
This study uses the L8 cloud cover assessment validation dataset as the source for model training and validation [6,27]. The L8 Biome dataset includes 96 Landsat 8 images sampled across the world, and with a manually generated cloud mask for each image. The name Biome is given because the target scenes are sampled based on biome types. The path/row of the scenes covering the land are sampled across the world on the basis of biome types which include urban, forest, shrubland, grass/cropland, snow/ice, wetlands, and water. In order to create a heterogeneous dataset, path/rows with obviously clustering patterns are discarded and will be re-sampled from their biome types. Then scenes within the final set of path/rows will be labeled as clear, mid-cloudy, and cloudy manually based on the cloud coverage percentage within each scene. A total of 96 scenes are prepared covering 8 biome types in 3 cloud coverage levels as shown in Figure 1.

Data Pre-Processing: The Intra-Group and the Ultra-Group
Validation masks and all spectrum band layers from the Biome dataset are cropped to 4000 × 4000 pixels for model training and validation ( Figure 2). These 96 cropped Biome images are then divided into two groups, the intra-group and the ultra-group. The intra-group includes 88 images from which we sub-sample to provide our training pixels. The name "intra-group" rather than "training-group" was given, because only a tiny portion (up to 1.3%) of the intra-group are used as the model training source to show the effectiveness of our pixel based approach. Therefore, more than 98% of the data in the intra-group are "new" to the models, which also makes the intra-group qualified for model performance evaluation. On the other hand, the ultra-group contain 8 images that our models have never seen, which means the ultra-group are 100% "new" to the models. Images in both groups will be used for model performance evaluation. For each biome type, eleven cropped images are selected as the intra-group, and we hold the one remaining image as the ultra-group (See Table 1). Then the 10 wavelength bands of each scene are stacked as a TIFF file for model training and validation. However, band 8 has a 15 m resolution which is higher than other bands. We downgrade the resolution for band 8 to make the imagery resolution the same as the other bands.

Ensemble Machine Learning Classification Approach
Ensemble learning is a machine learning paradigm where multiple weak learners are combined to improve the training performance. Various types of learner aggregation can be employed, including Bootstrap Aggregation (Bagging), Boosting and random space. In this paper, all three of these tree ensemble approaches are used, and the particular approach that will be employed in the tree-based ensemble model will be determined at the Bayesian Optimization stage.
Bagging aggregates trained weak learners by creating many bootstrap replicas and training each weak learner on one replica. Typically, the number of bootstrap replicas can vary from just a few up to several hundred. Each replica is generated by drawing N samples out of N observations with replacement (see Figure 3a). Drawing samples with replacement omits an average of 37% of the total observations which decreases the correlation between each weak learner to avoid over-fitting. Each weak learner is trained on a single replica by randomly extracting features for each split node which is called the random forest technique. Then the ensemble model makes a prediction on new data by taking the most voted predictions from individual learners for use in the classification.  Boosting trains learners sequentially (see Figure 3b). It maintains a weight distribution for all training observations and each observation is assigned a weight indicating its importance. By decreasing the weight for correctly classified observations and increasing the weight for misclassified observations at each round, trained learners are able to focus more on hard observations that have been misclassified by previous learners. Distinct from bagging which generates replicas for individual learner training, boosting trains all models with the same dataset but with different weight. Instead of taking the most votes of the predictions from individual learners, boosting combines the predictions from individual learners with weights. As a result, boosting is able to make individual learners focus more on hard observation points round by round, thus to obtain a premium result from weak learners. In this paper, AdaBoost, RUSboost, and LSBoost are candidate boosting methods employed for our ensemble models.

Representative Sampling
There are a total of 88 4000 × 4000 pixel images in the intra-group, a total of 1.408 billion pixels. A small portion of the samples will be drawn from the intra-group and used for model training. Generally, the more training data we use, the better machine learning model performance we achieve. However, for the task of cloud detection, the cloud label is very labor intensive to generate and is not widely available. One goal of this research is to explore a machine learning modeling approach that uses as small a fraction of the intragroup dataset as possible but can still fit the data well. In order to compare the performance of models trained on different sizes of training subsets, a representative random sampling approach is proposed to sample four subsets from the 88 cropped images. Each of the 88 images has 10 wavelength bands. After stacking the 10 wavelength bands together, each stacked image becomes a 4000 × 4000 × 10 matrix. Then a validation label is attached to the stacked matrix thus forms a 4000 × 4000 × 11 matrix for each image. These 3-dimensional images are converted to 2-dimension image tables, which are then concatenated along the horizontal direction and form an image pool table (Figure 4a). The next step is to draw representative samples from this image pool table (Figure 4b). One column of the image pool table is sorted and divided into n bins. These bins have an uniform width covering the range of pixel values of the column, which could potentially reveal the distribution of values in the column. Then m samples are randomly drawn from each bin of the column and this sampling process is repeated for all columns in the image pool table. Duplicated rows could be drawn when repeating the sampling process across each column, which could lead to an unbalanced sampling subset. Any duplicated samples are removed after the sampling process complete, and a sample subset that covers the value range of each column is obtained without any duplication. The bin number n and the sample number m are adjusted to obtain different training samples from the whole dataset. In this research, 4 training samples subsets are generated for model training (see Table 2).  The B1-B10 and Label in the headers represent the spectral band number and the pixels class labels. DN in tables represent the digital value. In the representative sampling procedure (b), the image pool table is sorted on a certain column (in green), then n bins are selected from the sorted column and m pixels are drawn from each bin. A final training sample is generated after repeating this process over all columns. Table 2. Representative Training Samples. n represents the number of bins, m represents the number of pixels sampled in each bin, and the Total represents the total pixels sampled for each subset. A set of 5 indices is derived from the 10 wavelength bands and used as input features. These include NDSI, NDV I, Whiteness, HOT, and B5/B6 ratio (See Equations (1)-(4)). These indices are sensitive to cloud pixels and often have a threshold that are able to differentiate cloud pixels from others [4]. Instead of testing these indices with a threshold value, they are integrated as part of input parameters into the proposed machine learning model.

Subset Name Bins (n) Observations in Bin (m)
NDSI and NDVI represent the normalized difference snow index and the normalized difference vegetation index, respectively. Both indices usually have a value near zero for clouds. Sometimes they could be larger for thin cloud over highly vegetated areas but should be less than 0.8.
The Whiteness index was first proposed by Gomez-Chova et al. [28] for Environmental SATellite (ENVISAT) and was modified as in Equation (2) later [4] for Landsat satellite images. The modified whiteness is effective in excluding non-cloud pixels due to their "non-flat" reflectance feature compared to the "flat" reflectance of cloud pixels.
The Haze Optimized Transformation (HOT) index was first proposed by Zhang et al. [29] and was modified by Zhu et al. [4], which is useful to separate haze and thin cloud from clear-sky pixels.
The B5/B6 ratio is able to separate "bright" rocks from cloud pixels as "bright" rocks usually have a higher Band 5 value than Band 4 while cloud has a higher Band 4 value than Band 5.

Self Organizing Map
A self-organizing map (SOM) is an unsupervised clustering method that uses a neural network to represent training data in a low-dimensional space [30]. The 10-dimensional input vectors are grouped into 100 clusters with 100 neurons. Each neuron has a weight vector which will be updated based on competitive learning approach at iterations. At each iteration, all neurons will be compared with a random selected training vector, and the one that is the closest in distance to the training point will be the winner and rewarded for moving closer to the training point by updating the weights. A neighborhood function is used to update the weights of the neighbors of a winning neuron in order to preserve the topological properties of the input space. Each cluster links to those image pixels sharing similar features. Although the SOM clustering results may not be able to directly differentiate cloud pixels from others pixels, it can provide useful information for the pixel type that can be utilized as an input feature for the training of the ensemble models. The SOM learning process can be represented by the neuron weight update process that is defined by (5).
where ∆W j,i denotes the updated weight value, t is the epoch, i and j are neuron indices, and I(x) is the winning neuron. η(t) refers to the learning rate which controls the neuron learning speed. T j,I(x) (t) denotes the neighborhood function which defines the extent of neighbor neurons of a winning neuron. S j,i is the distance between neurons, and the S j,i is the neighborhood size.

Model Training
A total of twelve tree-based ensemble models were trained on the four subsets sampled from the 88 cropped Landsat 8 images. Depending on the selection of input features, these machine learning models could be divided into three groups. Each model group consists of four models with the same specifications but trained on four different sizes of training samples.
The features used to train the first model group were the 10 wavelength bands from the 88 cropped images. For the second model group, in addition to the 10 wavelength bands, the 5 derived indices discussed in Equations (1)-(4) were also included as input features. The training process of the third model group consists of two steps. The first step is to build an SOM model on the training sample and use the SOM model to classify all the pixels from the training sample into clusters. The second step is to then include the SOM cluster label as an extra input feature in addition to the 10 wavelength bands. Then an ensemble learning model will be trained by employing the 10 wavelength bands and the SOM labels as the input features. The target variables for each subset are the pixel class which consists of cloud, clear, thin cloud, and cloud shadow. Models are summarized in Table 3 and the sample sizes are summarized in Table 2.

Hyper-Parameters Optimization
Hyper-parameters are the parameters that control the model training process, and they are parameters that can be optimized. Different combinations of hyper-parameters can lead to different model performance. Hyper-parameter optimization refers to the process used to find a set of hyper-parameters that yields an optimal model which minimizes the loss function on the given validation dataset. Grid search, random search, and Bayesian optimization are three of the most common approaches to perform hyper-parameter optimization. Grid search exhaustively searches all the combinations from a manually defined hyper-parameter space of a model and is the most expensive approach especially when feature space is large. Random search is a slight modification of grid search where hyperparameter combinations are randomly selected from a manually defined hyper-parameter space instead of exhaustively enumerating of all combinations. This search method is faster than grid search, but both of them are limited to prior knowledge of hyper-parameter distribution specifications. Bayesian optimization, on the other hand, attempts to find the global optimum in minimum steps without the need to manually define each hyper-parameter sample points. Bayesian optimization builds a probabilistic model on the hyper-parameter values and the objective function, the surrogate model. A Gaussian process (GP)-based surrogate model is a popular one for Bayesian optimization because it is cheap to evaluate. The function to propose hyper-parameters combinations is referred to as the acquisition function, which is an important part of a surrogate model. The Bayesian optimization algorithm can be defined by Equation (6).
where u is the acquisition function, X t is the hyper-parameter sampling point at iteration t, D 1:t−1 is the objective function values and hyper-parameters samples from previous t − 1 iterations. Bayesian optimization tries to find the hyper-parameter combinations at step t that is able to maximize the acquisition function u. Then the evaluation results from the objective function at step t which will be added to previous results to update the GP. Hyper-parameters for the tree-based ensemble models are optimized by Bayesian optimization. Tuned parameters includes the number of learners, the ensemble approach, the learning rate, the minimum leaf number, split criteria, the number of variables used in each node, and evaluation times.

Cloud Mask Prediction and Accuracy Evaluation
There are 4 training subsets sampled from the 88 images in the intra-group. Based on Table 2, each training subset accounts for only a very small portion of the total pixels present in the 88 images. The portion is not direly set but determined by the bin number n and the sample number m in the representative sampling stage. The portions range from 0.01% to 1.30% as shown in Table 2. Once the training processes are complete, these models are applied to the 88 images in the intra-group and the 8 images in the ultra-group to generate the predicted cloud masks. Fmask 4.0 is also applied to the 88 images in the intra-group and the 8 images in the ultra-group. The masks generated using the Fmask 4.0 algorithm include five classes which are clear land, clear water, cloud shadow, snow, and cloud. These classes are grouped into cloud, clear, and shadow to match the output of the machine learning models.
Each of the LC8 Biome images has an associated manually generated mask created by an analyst for validation purposes [6]. Previous cloud masking studies reveal that the differences due to the analyst's interpretation is about 7% [31]. To avoid this difference, all these cloud masks are digitized by a single analyst. Then, image-based confusion matrices are generated for each model and Fmask. For each image, there will be 13 confusion matrices generated, 12 of which are for the 12 model predictions, and the thirteenth for Fmask. These confusion matrices furnish the foundation of the accuracy measure calculations discussed in the next section.

Accuracy Measurements Establish
Confusion matrices of the entire intra-group and ultra-group are generated for each of the machine learning models and the Fmask algorithm. These confusion matrices are then used to calculate the performance metrics including the correctness, the cloud commission error, and the cloud omission error at the image level and the group level. These measurements are defined in Equations (7) where correctness measures the overall model correctness for all three classes, cloud commission error measures the portion of pixels that are estimated as cloud but are actually not, cloud omission error measures the portion of cloud pixels that failed to be detected. Because those non-cloud pixels that are labeled as cloud will be removed in most remote sensing image analysis, cloud commission error could cause information loss. Compared to information lost, a failure of detecting cloud pixels will have a more severe negative impact on the image analysis. Therefore, controlling cloud omission error is the first priority of this research. Among all of the 88 intra-group images, only 32 images have a shadow class labeled. Limited by the training and validation size, shadow detection is not the goal of this research. Therefore, the commission error and omission error were built for cloud pixel identification.

Visual Comparison and Accuracy Assessment
For each of the 88 images we produced 14 predicted cloud masks, 12 of which are from our 12 machine learning models, one from Fmask, and one from the ground truth mask. Figure 5 illustrates four sample images with their associated ground truth image, ground truth mask, model prediction mask from model 88Mdl_5index_sample4, and Fmask. For these four sample images, both from the machine learning models and Fmask, we see that they perform well on cloud detection. However, higher commission and omission errors can be observed compared to the machine learning model prediction mask. For the forest image in the first row in Figure 5, both the prediction mask and Fmask show some degree of omission error, but Fmask has more missed cloud pixels. For the shrub image in the second row, Fmask mis-classified a large water zone as cloud. For the grass/crops image in the third row, Fmask missed some cloud pixels again in the thick cloud mixture zone. For the image in the fourth row, both masks provide a good fit based on a visual comparison.
To better explore the performance of machine learning model and Fmask, the accuracy assessment Table 4 is summarized for the four images in Figure 5. Both machine learning models and Fmask have good overall correctness (above 85%); however, Fmask has significant omission error in the first and third image, and has high commission error in the second image. This quantitative result agrees with the visual comparison result. The machine learning model has good consistent performance on the four sample images.
To provide a more comprehensive accuracy assessment, accuracy assessment tables are summarized for both the entire intra-group (Table 5) and the ultra-group ( Table 6) for each of the machine learning models and for the Fmask algorithm. Tables 5 and 6 are sorted on cloud omission error in ascending order. From Table 5, the 88Mdl_SOM_sample4 performs the best on all measurements on the 88 intra-group. As can be seen in Table 6, the 88Mdl_5index_sample3 has the lowest cloud omission error and the model 88Mdl_sample3 has the best correctness performance in the ultra-group. Fmask 4.0 has the second lowest cloud omission error.

Ground truth Ground truth mask Prediction mask Fmask
Forest Shrub Grass/Crops Urban Figure 5. Sample image comparison between the validation mask, 88Mdl_5index_sample4 model predicted mask and Fmask. Each row displays an image with a certain type. The first column represents the ground truth image and the second column represents the ground truth mask, which is manually generated by expert. The third column represents the model prediction results and the fourth column displays the Fmask application result. As can be seen in Table 6, Fmask has the lowest performance for all metrics when compared to the machine learning models. To have a in-depth exploration on the factors influencing the Fmask performance, the eight images with the lowest performance are summarized ( Table 7). Four of the eight images are plotted in Figure 6. The accuracy measurements are summarized at image level for the the Fmask and the five indices machine learning model trained on sample4 for comparison. The machine learning model performed well on all images, while the Fmask has high variability across the images. The overall correctness could decrease to as low as 7.47% for the Ice/Snow images. Four out of the eight images are for the ice/snow land type. These images have the common features that the background reflectance is "bright". This feature causes trouble for Fmask in two ways. In some images, Fmask mis-classifies ice/snow as cloud, while in some images Fmask mis-classifies cloud as clear pixels. This quantitative finding conforms with what one can see by visual inspection of the results in Figure 6, where we see that Fmask has difficulty in detecting cloud in images in the first and third row. In the first row, Fmask mis-classifies the cloud pixels as snow/ice-covered clear pixel. In the third row, Fmask mis-classifies snow/ice-covered clear pixels as cloud. As can be seen in the fourth row of Figure 6, water is another case with "bright" background, where Fmask mis-classifies large water regions as cloud pixels.
The other three images types listed in Table 7 are grass/crops, forest, and shrubland. The grass/crops, forest and shrub images are similar in that they absorb a large fraction of the visible bands and make the images "dark". This effect could influence the performance of Fmask. As can be seen in the Table 7, Fmask has high cloud omission error over these three land types. The machine learning model, on the other hand, has a stable performance across the different land types. The cloud commission errors are NaN in the table because this image is 100% covered by thin cloud; thus, by definition, no cloud commission error can be calculated. For the shrubland image, Fmask mis-classifies large thin cloud zones as clear pixels (second row of Figure 6).

Ground truth Ground truth mask Prediction mask Fmask
Ice/Snow Shrub Ice/Snow Water Figure 6. Fmask miss-classified images. Each row displays an image with a certain type. The first column represents the ground truth image and the second column represents the ground truth mask. The third column represents the model prediction results and the fourth column displays the Fmask application result.

Feature Importance and Topology Analysis
Machine learning has been traditionally treated as a "black box", because most of the machine learning algorithms focus on solving problems based on training data but do not readily give interpretations on how the input variables influence the prediction outcome. The unsupervised SOM classification method and the supervised tree-based ensemble classification method provides ways to explore these influences of the input variables on the prediction outcomes. As in Figure 7, a 10-by-10 network is constructed, which includes 100 neurons. There is a weight vector associate with each neuron and the weight vector has the same dimension as input vectors. As the training process goes, these weight vectors move to the center of input vector clusters. Once the training process completes, the SOM model is applied to the training dataset to explore the variable characteristics. Each neuron represents a cluster, and each input vector will be assigned to one of the 100 neurons. A useful feature of SOMs is topology preservation, which means those close vectors in the input space remain close together after being projected to 2D planes [32]. Therefore, a high dimensional vectors can be visualized in a 2D plane.   Figure 7a illustrates the SOM neuron structure in a 2D gridded plane. As can be seen in Figure 7b, the SOM hits map displays how many input vectors fall within each SOM class. There are several large clusters of vectors annotated with red circles which have hit numbers of greater than 300,000. The Figure 7c indicates the feature space distance between each node. Black and dark red represent longer inter-SOM class separation distances while yellow and orange represent shorter inter-SOM class separation distances. Several red and black connections marked with blue ovals segment the whole input vectors into different zones. Each zone shares some degree of similarity with the adjacent classes. The black regions indicate the longest inter-class separations. Figure 7d displays the weights of each input spectral bands on the neuron plane. Similar patterns in weight planes indicate high correlation between the different bands. As can be seen, band 1, band 2, band 3, band 4, band 5 and band 8 indicate a strong correlation pattern in the 2D topology. Band 6 and band 7 share some degree of similarly. On the other hand, band 9 and band 10 give unique information to the SOM model. These plots give information of how the input variables influence the unsupervised classification and the similarity between classes. The SOM model is able to integrate these many input variables and provide new insights into the data, which can be used as an additional input feature for the tree-based models to enhance their performance.
One advantage of tree-based models is that they are easy to interpret. The decision tree algorithms internally select the variable which can most reduce the entropy at each decision tree splitting node. The variable that can most reduce the entropy will be put in the root node, and the second most important variable will be placed next after the root node. Thus, variables are sorted based on their capability to reduce the entropy. As a result, the importance and contribution of each variable to the model can be easily traced and explained. Split nodes are determined by their capability to reduce the entropy of the leaf node. Therefore, the importance rank of input variables in differentiating clusters can be traced and visualized. Figure 8 displays the importance rank plot for the three best performing ensemble models, which are the SOM model trained with sample 4, the five indices model trained on sample 3, and the base model with only 10 wavelength bands trained on sample 3. Among these tree models, band 7, band 1, band 5 and HOT have the most influence on the predicted model class. For the model incorporating the SOM classification, the SOM class is the fifth most important input feature.

Algorithm Complexity and Running Time Analysis
The proposed ensemble method is based on a decision tree model; thus, the algorithm complexity of decision tree is the conceptual base for the ensemble algorithm complexity analysis. The depth complexity of a fully developed tree would be in O(log(n)) for a balanced tree, and O(n) for a extremely unbalanced tree, where n is the number of training samples. In practice, decision trees are rarely fully developed or extremely unbalanced. As a result, O(log(n)) is a good approximation for the tree depth complexity. A quality function is calculated for every node for each feature at all levels. The algorithm complexity for a decision tree would be O(m × n × log(n)), where m is the number of features. For bagging ensemble methods, random forest techniques are applied, where only the square root number of features are considered on each node for each tree. Thus, the training complexity is O(k × m 1 2 × n × log(n)) where k is the number of trees in the ensemble method. For Boosting ensemble methods, all the features are considered for model training, thus the complexity is O(k × m × n × log(n)). Selection of the Boosting or Bagging approach are two of the ensemble hyperparameters which are determined by the hyperparameter optimization. Once the the model is trained and ready for running, the run time complexity for ensemble types are O(k × m).
The running time is summarized in Table 8. Training time is the total elapsed time for all the steps to generate a model including representative training subsets sampling, model training, hyper-parameter optimization, plotting and model saving under Intel(R) Xeon(R) CPU E7-4850 v3 @ 2.20 GHz. The Running time is the average elapsed time to make a cloud mask prediction and to plot the prediction masks under Intel(R) Xeon(R) CPU X5650 @ 2.67 GHz.

Hyperparameter Sensitivity Analysis
Hyperparameters are important parameters that control the behavior of empirical machine learning models. The hyperparameter optimization seeks to suppress the variance and bias of a model by finding a set of global optimal hyperparameters that minimize the objective function value. We proposed the Bayesian hyperparameters optimization which minimizes the cross-validation classification error with 30 iterations for each model. Optimized hyperparameters as well as the five-fold cross-validation objective function values measuring the model errors are summarized for each model. Hyperparameters at the initial iteration and at the best iteration are listed in the Table 9 to demonstrate how the optimization process improves the objective function value. Each model has two iterations result listed. The first iteration with the value one is the initial round of hyperparameter optimization, while the second number refers to the iteration where the best optimization results are achieved among the total 30 iterations. Ensemble column refers to the ensemble method including Bagging and Boosting two categories. Trees represents the number of single decision trees included in each model. LearnRate is a hyperparameter controlling the learning speed for the Boosting method only. MinLeafSize is the minimum number of samples required to be at a leaf node. The MaxNumSplits is the maximal number of decision splits for each tree. The MinLeafSize and MaxNumSplits together control the depth of tree models. As in Table 9, the objective function values decreased significantly after hyperparameter optimization for all the 12 ensemble models. Compared to the initial model, the optimized models are dominated by employing the Boosting method, with larger numbers for the MinLeafSize and more MaxNumSplits.

Discussion
This paper established and evaluated 12 ensemble machine learning models for pixelbased cloud classification trained and tested using the labeled Landsat 8 Biome dataset. In addition to the information from the ten Landsat bands some additional input features were used, including an unsupervised self-organizing map classification and five indices derived from various band combinations. This feature engineering improved the performance of the machine learning cloud pixel classification.
Typically, training a machine learning model by using as much data as possible enables a better model performance. However, when it comes to supervised cloud classification, we are often constrained by the availability of labeled images and sometimes by the available computing power. We therefore designed a strategy to use only a tiny subset of the total number of pixels available for model training.
The objective of this research is three-fold. First, we described, implemented and validated an ensemble approach to build supervised classification model for cloud detection, including model selection, selection of representative training data, and feature engineering. Second, we explored the importance of the input features. Last but not least, we investigated how the size of the training subsets influenced the model performance.
Generally, a larger number of training samples enables better model performance when using the intra-group, while the training size has less influence on the model performance of the ultra-group. In addition, our ensemble models had consistent performance on all land types in contrast to Fmask which had trouble in differentiating cloud pixels in some "bright" and "dark" area such as snow/ice, shrub, and some forest area. The fact that the overall model performances on the intra-group are better than the performance on the ultra-group is reasonable because all the training data are sampled from the intra-group, even though just a small fraction of the total available pixels were used. As shown in the Tables 2 and 5, models trained on subsets that only accounted for 1.3% of the total pixels available could achieve an accuracy as high as 98.57% when applied to the whole 88 intra-group images. For the independent validation dataset that our machine learning models had never previously seen, even though the performance decreased, they still out-performed Fmask.
Landsat 8 images could suffer from various degradation, noise, or variability during the image processing. The generation capability of the machine learning models have been largely determined by the representatives of the training samples. In this study, the LC8Biome cloud validation data are designed to cover the geographical scope of the world and the eight biome types, and includes three levels of cloud coverage (clear, mid, cloudy). This approach makes the training set representative enough to allow a model with good generalization for the normal radiance variability of LC8 data. However, the LC8 Biome dataset is free from severe noise, which is caused by such as equipment failure, which makes the cloud detection on images with severe noise out of the scope of the proposed machine learning models. Nevertheless, the only well-known issue so far for LC8 is the thermal band failure. Although the lack of cloud validation masks for severely distorted image stops us from doing an accuracy assessment; the incorporation of 10 wavelength bands as predictors in the machine learning models allows some degree of resistance to certain band failure. In addition, the proposed machine learning method is also suitable to establish a cloud detection model for images with high noise once the training samples with severe noise are available.

Conclusions
Overall, the tree-based ensemble machine learning models with Bayesian optimization achieved better performance than Fmask for all metrics over the 88 images in the intragroup. On the eight images in the ultra-group, 88Mdl_5index_sample3 has the lowest cloud omission error and the 88Mdl_sample3 has the best correctness. The base model with 10 wavelength bands as the input variable could achieve good performance for both the intra-group and ultra-group. Fmask had issues in differentiating cloud pixels mainly in ice/snow images, and some in the water, forest, and shrub images.
The empirical pixel-based machine learning models discussed in this paper could achieve very good and consistent performance when using only a tiny portion of the total available training data. This result indicates that a pixel-based method can perform well, and that each scene obviously has unique spectral characteristics, and having a small portion of example pixels from each of the sub-regions in a scene can improve the model accuracy significantly. Integrating the self-organizing map results and five band derived indices as a part of input variables could further help to reduce the cloud commission error and cloud omission error compared to the base models. Based on hyperparameter sensitivity analysis, the Boosting ensemble method with a small MinLeafSize and large MaxNumSplits are effective settings for high accuracy model training. Among all the input variables investigated in this research, band 1, band 5, band 7, and HOT are the five most important variables for the best three models; Band 10, band 6, and SOM are in the second tier; band 4, band 6, band 9, NDVI, whitness, B5/B6 ratio are in the third tier which also contributes to the model performance.