Automated Detection of Cloud and Cloud Shadow in Single-Date Landsat Imagery Using Neural Networks and Spatial Post-Processing

The use of Landsat data to answer ecological questions is greatly increased by the effective removal of cloud and cloud shadow from satellite images. We develop a novel algorithm to identify and classify clouds and cloud shadow, SPARCS: Spatial Procedures for Automated Removal of Cloud and Shadow. The method uses a neural network approach to determine cloud, cloud shadow, water, snow/ice and clear sky classification memberships of each pixel in a Landsat scene. It then applies a series of spatial procedures to resolve pixels with ambiguous membership by using information, such as the membership values of neighboring pixels and an estimate of cloud shadow locations from cloud and solar geometry. In a comparison with FMask, a high-quality cloud and cloud shadow classification algorithm currently available, SPARCS performs favorably, with substantially lower omission errors for cloud shadow (8.0% and 3.2%), only slightly higher omission errors for clouds (0.9% and 1.3%, respectively) and fewer errors of commission (2.6% and 0.3%). Additionally, SPARCS provides a measure of uncertainty in its classification that can be exploited by other algorithms that require clear sky pixels. To illustrate this, we present an application that constructs obstruction-free composites of images acquired on different dates in support of a method for vegetation change detection.


Introduction
The Landsat archive provides an unprecedented opportunity to discover how our landscape has changed over the last 30 years.Much of the imagery, however, is contaminated with clouds and their associated shadows, particularly in the tropics and other forested areas with high transpiration [1].Therefore, the usefulness of this imagery for landscape change studies depends on reliably separating clear sky regions from those obstructed by clouds and cloud shadow.Because of the large number of scenes over multiple dates needed for such studies, accurate and reliable automated methods are essential for this task.
Significant work has been devoted to cloud and cloud shadow identification.Many algorithms for cloud and cloud shadow masking have been developed for other sensors, particularly for AVHRR [2][3][4][5] and MODIS [6][7][8].Some of these algorithms have then been adapted for use on Landsat data (see, for example, Oreopoulos et al., 2011 [9]).Since clouds are bright and cold and cloud shadows are darker than the surrounding landscape, a common approach is to apply a threshold to the spectral values [10] or some simple function of two or more spectral values [9,11].An early example is the automatic cloud cover assessment (ACCA) [12], which uses a series of successive thresholds over bands and band combinations to define a hierarchical set of rules for clouds.ACCA was not designed for precise spatial detection of clouds, but rather, to estimate the percentage of cloud cover in a given Landsat scene.These methods operate over imagery acquired on a single date; multi-temporal methods can leverage additional data to detect clouds and their shadows [13,14].Since clouds are typically bright objects in a scene and shadows necessary darken an area, the obstructions can be identified by looking for outliers from a reference scene.In an automated approach, however, where the reference scene must be selected algorithmically, this is only effective when a good cloud detection method for single-date imagery is already in use or when most images are cloud-free and therefore clouds and cloud shadows are outliers from the mean, an assumption that may not hold in areas with frequent cloud cover.
Cloud shadows are more difficult to identify than clouds, because the spectral information does not discriminate between shadows caused by clouds and shadows arising from other causes, such as terrain.Additionally, other dark land covers, such as dark vegetation or water bodies, have similar spectral signatures to shadows.To address this confusion, the cloud mask itself has been used to distinguish cloud shadows, using the known sensor and solar geometry to estimate where cloud shadows should occur given the location of clouds [5,10,11,[15][16][17][18]. Such approaches are sensitive to the height of clouds above the land surface, as this height is proportional to the two-dimensional distance of a cloud from its shadow, as seen in imagery.Cloud height can be estimated using the thermal band [17,18], and combining this with a digital elevation model, as in the algorithm from Huang et al. [17], can further reduce error.
In this paper, we develop a novel method for identifying clouds from Landsat TM and ETM+ imagery: Spatial Procedures for Automated Removal of Cloud and Shadow (SPARCS).This development was motivated by our need for efficient and reliable cloud and cloud shadow masking in a forest change detection application over highly heterogeneous land cover in the eastern U.S.; existing methods were either too computationally intensive or missed many clouds or cloud shadows, which were detected as change.Four design objectives directed our development.First, the method should only use bands present on all sensors and avoid ancillary data sources in order to ensure that the method is applicable over the entire archive.Second, the method should be completely automated and free from operator input.Third, the method should be sufficiently computationally efficient to be applied over thousands of scenes.Finally, the method should provide a spatially-explicit measure of classifier certainty that can be propagated to the products relying on the resultant cloud and cloud shadow masks, a feature we are unaware of in other cloud detection algorithms.
To meet these goals, we use neural network classifiers [19] to explore different methods of using spatial information contained in a single-date Landsat scene to address the cloud and cloud shadow detection problem.These classifiers are trained using scenes with clouds and cloud shadow labeled by human operators at USGS [20] and evaluated using additional manually labeled data.Using this evaluation, we choose a high-quality classifier to become the basis of SPARCS and apply a series of spatial post-processing procedures to resolve ambiguous pixels in the classifier outputs.We then compare SPARCS to a high-quality, commonly used method, FMask [18].Like FMask, we also include a class for water and snow/ice, for completeness.Finally, we demonstrate the utility of our method using an example that exploits the classifier uncertainty provided by SPARCS to combine a multi-temporal image stack into an obstruction-free composite.

Background
Neural networks are non-linear supervised learning algorithms that can be trained to partition an input space into a set of classes.Neural networks work by learning h linear combinations of the input data, where h is determined by the operator, and passing each of these through a given non-linear thresholding function.These results are temporarily stored as hidden values.Then, the network repeats the process by taking c linear combinations of those hidden values, where c is the number of desired classes, and again passing them through a given non-linear thresholding function.These results are then interpreted as the input observation's membership in each output class and are wholly dependent on the weights of the linear combinations at both stages.These memberships are continuous values between zero and one; if desired, a crisp classification can be performed by setting the highest membership value to one and all others to zero.The weights themselves are learned using an optimization procedure over a training dataset that contains observations labeled with their correct class.The correctness of the labels in this training dataset directly controls the quality of the resulting classifier.Additionally, a higher number of hidden values (h) allows more complex patterns in the input data to be learned, though it also increases the likelihood of learning spurious correlations in the training data and thereby reducing the generality of the classifier [19].
Importantly, the input data must include non-ambiguous information about the desired output classes to be able to discriminate between observations.This, however, is not the case when attempting to identify cloud and cloud shadow from aspatial Landsat pixel data over a wide range of land cover types, as the exact same spectral data can be associated with pixels of clouds, snow or some other bright and cold feature, or associated with cloud shadow, terrain shadow, water bodies or some other dark terrain feature.In short, pixel data by itself is ambiguous.As such, an additional source of information is needed to resolve these cases, such as elevation data, which is useful for distinguishing cloud shadow from terrain shadow, observations from multiple time periods to filter ephemeral values or spatial relationships between pixels.We are most interested in harnessing spatial information, because it is already present in the Landsat scene, and because human classifiers can almost always visually identify clouds and cloud shadows within a scene, spatial information should be sufficient for discrimination.In addition to applying spatial adjustments to the classifier output in a post-processing step, we examine two simple methods for incorporating space into neural network inputs.The intuition behind both methods is that, by providing an estimate of "average value" in a region, ambiguity caused by variation within objects could be reduced, creating a simpler problem for the neural network to learn.The first method is to simply calculate the mean spectral value within a neighborhood around each pixel.The second method uses the pixel values from the image after denoising using total variation regularization (TVR) [21].TVR removes noise from an image, f , by finding a new image, u, that minimizes the functional: The first term is the sum of squared-errors between the new image and the original image; minimizing prevents the new image from diverging too far from the original.The second term measures the magnitude of differences between adjacent pixels; minimizing favors outputs with similar adjacent values.Taken together, the method balances smoothing parts of the image with keeping original details.How this balance is struck is determined by α, which should be positive, with smaller values favoring more detail and larger values favoring more smoothing.Due to the nature of the gradient term [21,22], the smoothing manifests as regions of the image with constant values and sudden discontinuous jumps at the edges.This constant value can be thought of as an average of the spectral values within the region.

Data Sets
Manually-generated cloud masks from the USGS LDCM Cloud Cover Assessment Data [20] were subset, refined and used as a training dataset for several different neural network classifiers.The core-cloud and thin-cloud classes, which were separate in the USGS data, were combined into a single cloud class.Of the 157 scenes in the dataset, only the 18 scenes with more than 1% of pixels in both the cloud and cloud shadow classes were considered.From these, twelve representative scenes (Figure 1) from different hemispheres and latitudes were selected for the training dataset; the other six scenes had inadequately accurate cloud and cloud shadow masks for training.To reduce the amount of data used, while retaining land cover variability within scenes, four 1000 × 1000 pixel (30 × 30 km) regions, each separated by at least 2000 (60 km) pixels, were subset from each of these twelve scenes.These 48 sub-scenes were used for training only.
The 48 sub-scene masks were visually assessed for classification label accuracy.The USGS masks provided classes for clear sky, cloud, and cloud shadow.Additionally, water and snow/ice classes were added to the masks.To do so, the Normalized Difference Snow Index (NDSI) [23] and tassel-cap brightness [24] were used to locate potential water and snow/ice regions in the imagery.These proposed masks were then hand-edited to include regions missed by the thresholding and to remove regions incorrectly included.The water and snow/ice masks were then combined with the USGS masks using the follow precedence rules.Pixels labeled as cloud in the USGS mask were unchanged.Pixels labeled clear sky in the USGS masks and flagged as either water or snow/ice were changed to water or snow/ice, as appropriate.Pixels labeled cloud shadow in the USGS masks and flagged as water were hand set appropriately.Cloud shadows over water were labeled as cloud shadow where a distinction could be made.Twelve additional scenes were selected for testing, and one 1000 × 1000 pixel (30 × 30 km) sub-scene was extracted from each.These 12 sub-scenes were then clustered over their spectral data using k-means clustering [25] with 100 cluster seeds, which sorts pixels into 100 groups with similar spectral values.Each resulting cluster was assigned to one of the five classes by an operator.Then, mislabeled image regions were hand-corrected using image editing software.These 12 sub-scenes were used only in classifier assessment, and not during training, in order to reduce the risk of neural network over-fitting during comparisons.All analyses comparing different networks and methods used only these testing scenes.
Landsat 7 ETM+ Level 1T imagery and metadata for all scenes used in training and testing were acquired from the USGS archive.All scenes were from 2001, before the scan line corrector failure.The Landsat ETM+ data in Bands 1-5 and 7 in each sub-scene were corrected to top-of-atmosphere reflectance [26,27] and then further corrected using dark-object subtraction [28].The low-gain scaling of the thermal band (B 6 ) was converted to brightness temperature and then arbitrarily rescaled to values near the other bands to facilitate neural network learning: These values were used in all analyses.

Neural Network Classification
A total of 15 neural network configurations were used to explore the role of classifier complexity and the inclusion of spatial information on classification accuracy (Table 1).Networks with 10, 20 and 30 hidden nodes were constructed using five different types of spatial inputs.In addition, all configurations included the aspatial spectral information (ETM+ Bands 1-5, 7) and rescaled brightness-temperature (ETM+ Band 6) for each individual pixel.The first spatial input type added no spatial information and was used for baseline comparison.The second through fifth spatial input types added spatial information summarized from the first three components of the tassel-cap transformation [24].Spatial input types two and three added the local average in a region around each pixel in the three tassel-cap bands, using a 5 × 5 and 9 × 9 pixel neighborhood, respectively.Spatial input type four and five used tassel-cap pixel values after removing spatial noise using TVR [21,22] with α = 0.05 and α = 0.10, respectively, which remove amounts of detail similar to the local averaging neighborhoods.The five types therefore represent a no-space baseline, plus two spatial-averaging methods, each using two intensities.Training data for the neural networks was randomly sampled from the 48 training sub-scenes after stratifying each sub-scene by class (cloud shadow, cloud, water, snow/ice, clear sky).Where possible, 1500 pixels from each class were selected from each sub-scene.If a sub-scene did not have 1500 pixels of a class, all pixels of that class were selected.The aspatial spectral and brightness-temperature data, the 5 × 5 and 9 × 9 pixel averages in the tassel-cap indices and the two TVR-denoised values of the tassel-cap indices were extracted for each pixel.This process was performed three times, with different stratified random samples selected each time, to generate three sets of training samples.A total of 166,639 samples were used for each network.Each network configuration was trained using each training set, generating a total of 45 networks consisting of the 15 configurations replicated three times.
Networks were trained using scaled conjugate gradient backpropagation by the MATLAB Neural Network Toolbox [29] patternnet() function.

Spatial Post-Processing
Clouds and cloud shadows are spatially-coherent objects in satellite imagery.SPARCS uses information on surrounding pixels to exploit this spatial coherency for the purposes of reducing classification error.In exploratory classifications using several different neural networks, error maps were generated to visually assess the spatial patterns of errors.From these observations, a series of six rules were developed to address spatially-definable error.Each of these rules operates over the continuous-valued membership images for each class.First, a 3×3 median filter is applied to the cloud and cloud shadow membership images to reduce noise.
The second rule addresses confusion at water-land boundaries.Shallow water is often confused for clouds or snow/ice, and the wet soil in the transition zone between water and land is often confused with cloud shadow.To correct this, the cloud, cloud shadow and snow-ice membership of pixels within three pixels of large bodies of water are decreased.
The third rule uses sun and sensor geometry to identify areas of potential cloud shadow using the cloud membership to reduce the significant ambiguity between hill shade, wet ground and cloud shadow.Our approach closely resembles the method of Luo et al. [8] in that it defines a broad area of potential cloud shadow that accounts for a range of potential cloud heights and then combines it with an estimate based on spectral values.Our method uses the cloud shadow membership values for the initial estimate.The direction of the sun is first determined from scene metadata.Then, a copy of the cloud membership image is transposed away from the sun a distance determined by the sun elevation and a cloud height of 2250 m and then expanded (dilated) to include potential cloud heights from 1800 m to 2700 m above the ground, a height range chosen to capture the dark shadows created by the optically thick cumulus clouds.This potential cloud shadow location is then further expanded and blurred using a 15 × 15 pixel filter to create a feasible zone of cloud shadow.Finally, this estimate is multiplied with the cloud shadow membership to increase cloud shadow membership within the feasible zone and to reduce the cloud shadow membership in areas outside of the feasible zone that likely represent terrain shadow erroneously classified as cloud shadow.
The fourth rule addresses confusion between water and deep shadow, which can have equivalent spectral signatures.Pixels that have similar memberships in both cloud shadow and water, meaning that the neural network classifier has identified them as ambiguous, are selected.Those that are also surrounded by pixels of high cloud shadow membership then have their own cloud shadow membership increased and their water membership decreased.
The fifth rule performs a similar function between clouds and snow/ice, which can have similar ambiguity, biasing membership toward clouds and away from snow/ice in pixels surrounded by clouds.
The final rule identifies pixels of high overall uncertainty and uses the membership of nearby pixels to predict the correct membership of the uncertain pixels.First, uncertainty is calculated as the variance between memberships, rescaled to be between 0 and 1.Then, a weighted average of nearby pixels is calculated for each membership class, with weights calculated as the product of each pixel's certainty and a Gaussian decay function over distance with σ = 2 pixels.Finally, new memberships are calculated as a linear combination of the original value and the spatial average, weighted using the pixel's uncertainty, such that more certain pixels retain their original value and uncertain pixels become more like the average value of pixels around them.This rule has the effect of homogenizing areas and removing noise.

Classifier Assessment
Each of the 45 neural networks was scored on each of the 12 evaluation sub-scenes based on the total classifier accuracy.For each pixel, the assigned class from the network was taken as the class with the maximum membership value over all classes.These classes were compared to the evaluation masks described in Section 3.1.Because clouds and cloud shadows are not discrete objects, there are many semi-obstructed pixels that form a transition zone between cloud or cloud shadow and clear sky.Since we are more concerned with identifying potential clouds and cloud shadows than with precisely defining their extent, a three-pixel buffer around the areas labeled as cloud and cloud shadow in the evaluation masks was constructed.Pixels within this buffer were scored as correct if they were labeled as either cloud or cloud shadow, as appropriate, or the label in the evaluation mask.This reduced commission errors and increased the overall accuracy for all methods, including FMask, by approximately 2%.
A multiway ANOVA [30] was performed on the 540 accuracy scores to assess the contributions of classifier complexity and the type of spectral information to accuracy, while accounting for the variations between sub-scenes.Networks with more inputs or more hidden nodes were not penalized for additional model complexity, since the purpose was to find the most effective classifier and not necessarily the most efficient.
After selecting a network to serve as the basis of our method, we then compared our method to FMask using the same classifier accuracy statistics derived from the testing data with buffered masks.We further compared the methods by calculating omission errors as the percentage of cloud or cloud shadow pixels that were mislabeled as clear sky and commission errors as the percentage of clear sky pixels, outside of the buffered area, mislabeled as clouds or cloud shadow.This was done, because FMask has significant confusion between clouds and cloud shadow, and we do not feel that including that confusion is relevant to the core question of whether the classifiers can separate clear sky pixels from obstructions.

Application: Obstruction-Free Summertime Composites
Though SPARCS can provide a crisp classification, wherein each pixel is labeled as exactly one class, by using the raw membership values, neural networks provide a measure of how certain the classifier is in assigning class membership.In this method, we use these original membership values to create clear sky composite images from a multi-temporal stack of Landsat TM scenes acquired within the same year.For the purposes of illustration, we selected four scenes acquired during late summer of 1990 from the same location in eastern Tennessee that had moderate cloudiness on visual inspection.Each of these scenes were classified using SPARCS to generate memberships in cloud, cloud shadow, water, snow/ice and clear sky classes.
For each scene, the pixel memberships in the water and clear sky classes were then combined to generate a clarity index (Q) for each pixel.Additionally, because including a marginally contaminated pixel is more harmful than excluding a marginally clear pixel, this index was squared.
where m W is the water membership and m L is the clear sky membership.For the purposes of this example, snow and ice are considered as obstructions, as snow is seasonal in the area of interest.In high altitude or latitude areas where snow and ice are persistent features, the class should be included.
In order to reduce the influence of phenology, scenes are also weighted by a Gaussian decay function of day-of-year, such that scenes further away from a given date are weighted less than scenes near that date.For this example, we chose a late summer day (day of year 225) and used a standard deviation of 30 days: where d j is the day of year that the j-th scene was acquired.Other days could be chosen, and a comparison between composites weighted to different days could be fruitful to examine phenological effects, as long as care is taken that the decay function itself does not span significant phenological change.
Yearly summertime composites are then generated as a weighted average of all the summertime scenes each year, using the clarity index (Q) and the Gaussian-transformed distance from a target date (w j ) as weights.Each of the seven Landsat bands are computed independently: where A b is the composite image of band b, S is the set of selected scenes to be combined, B b,j is the image data of band b for scene j and Q j and w j are the weights for scene j described above.

Network Selection
We trained 45 neural networks over 48 training sub-scenes and explored the effects of network size and the type of spatial inputs on classification accuracy over 12 evaluation sub-scenes.The inclusion of spatial inputs into the neural network had no statistically significant impact on classifier accuracy over the evaluation dataset.(Table 2).Network size was significant at the 0.05 confidence level; in a post hoc test using Tukey's honestly significant difference (HSD) criterion [31], statistically significant increases in total accuracy were seen in networks with 30 hidden nodes over those with 10 and 20 hidden nodes, which were similar (Figure 2).Absolute gains, though, were small; networks with 30 hidden nodes increased total accuracy by approximately 0.5% or about a 15% decrease in error.The interaction term between network size and sub-scene was also significant, suggesting that the increase in accuracy is due to the ability of more complex networks to learn additional land cover features and that more complex networks are not simply overfitting training data, but become more general classifiers.After post-processing the neural network output with spatial procedures, classification accuracy increased overall, from approximately 94.5% to approximately 97%.The post-processing procedures also exaggerated differences between the methods of including spatial inputs to the neural network, which became statistically significant by the ANOVA F-test (Table 3).Tukey's HSD separated the TVR method with α = 0.05 from the methods that used a mean over a local neighborhood, with the TVR method with α = 0.10 and the method with no spatial inputs being intermediate between the groups (Figure 2).However, the difference between the means of methods to incorporate space within the neural network are within 1% of total classifier accuracy, much less than the increase gained by the inclusion of spatial post-processing procedures.The spatial procedures preserved the patterns in accuracy between networks with different numbers of hidden nodes.Importantly, the application of spatial procedures greatly enhances the effectiveness of methods that have no spatial inputs to the neural network, suggesting that at least part of the classification rules learned by the networks that incorporated space are replicated by the post-processing spatial procedures.Given that calculating the spatial inputs, particularly TVR, is computationally expensive, there is no clear choice for an operational method.We selected a network with 30 hidden nodes and no spatial inputs to the neural network for further evaluation and to use in our cloud and cloud shadow detection package, SPARCS: Spatial Procedures for the Automated Removal of Cloud and Shadow.

Comparison to FMask
SPARCS compares favorably with FMask over the 12 sub-scene evaluation dataset (Table 4).The largest improvement is in correct identification of cloud shadow: SPARCS mislabels 3.2% of cloud shadow pixels as clear sky compared to FMask's 8.0%.Both methods perform well at identifying clouds, with FMask performing somewhat better by mislabeling 0.9% of cloud pixels as clear sky compared to with SPARCS mislabeling 1.3%.Considering errors of commission, SPARCS performs substantially better by mislabeling 0.5% of clear sky pixels as cloud shadow and 0.2% as clouds, compared to FMask mislabeling 2.4% of clear sky pixels as cloud shadow and 2.8% as clouds.
The spatial patterns of error are examined in Figures 3 and 4. A false color image of the scene mapping Bands 5, 4 and 2 to red, green and blue, respectively, is provided for reference (top rows).Classification output for SPARCS (left) and FMask (right) show agreement with the evaluation masks, with commission errors (purple) and omission errors (red) highlighted for clouds (light colors) and cloud shadows (dark colors).Figure 3 is of a scene with sparse mid-altitude clouds in the coastal region of New South Wales, Australia (WRS2 path/row 89/82).Both methods have strong cloud detection, though both miss some small, thin clouds in the northern portion of the image and thin clouds in the southwestern portion, as well as their respective shadows.Because the spectral signal of areas contaminated with thin clouds and their shadows is a mixture of cloud/cloud shadow and the underlying landscape, they are especially difficult to detect, as the resulting signal is ambiguous.Successful methods typically use multi-temporal image stacks and then detect deviations from an average or consensus signal [13].
FMask predicts cloud shadow by projecting the cloud mask onto the land-surface as a function of sun angle and topography without considering spectral information about shaded pixels.When this projection fails, this creates a pattern in the mask where the cloud shadow mask is offset from the actual location of the cloud shadow, as can be seen in several clouds in this image.SPARCS combines information about dark pixels from the neural network output with a similar, but much less precise, cloud projection approach to achieve more overall precision in cloud shadow locations.However, because of shadow/water ambiguity of dark pixels, this approach can causes over-shadowing in dark water near detected clouds, such as that in the eastern portion of the sub-scene.
A scene with dense, discrete clouds and cloud shadows in Hidalgo, Mexico (WRS path/row 26/46), is presented in Figure 4.For both SPARCS and FMask, clouds are detected very well, and most errors occur around the edges of cloud and cloud shadow objects.In SPARCS, some bright land cover in the eastern portion of the image is misidentified as clouds with some spurious associated cloud shadow.Additionally, some dark water is labeled as cloud shadow.FMask exhibits some bright land cover/cloud confusion, as well, though less than SPARCS.Again, the cloud shadow mask consistently fails to extend to the edges of cloud shadow objects, due to misprojection, resulting in substantial missed cloud shadow while simultaneously labeling unshadowed areas as shadowed.

±
Both methods have a halo of commission error around cloud and cloud shadow objects that results from design decisions to reduce contaminated pixels by expanding those masks slightly.This expansion approach creates a trade-off between commission and omission errors, with larger expansions capturing more obstructed pixels by sacrificing nearby clear sky pixels.Much of this halo is ignored due to the three-pixel buffer, described above, but some extends past that buffer and can be seen in the images.The larger halos in FMask represent the method's more aggressive efforts toward this goal.Over all 12 evaluation sub-scenes, SPARCS performs consistently better than FMask (Table 5).In only one sub-scene, which is predominately cloud cover over the Amazon rainforest, is the overall accuracy for SPARCS less than that for Fmask, and then, only by 0.3%.In that scene, SPARCS misses some thin clouds on the edge of the bulk of the cloud mass.FMask and SPARCS perform relatively well or poorly on the same sub-scenes, that is, when SPARCS performs well, so does FMask, and vice versa, suggesting that in some sub-scenes, separating clouds and cloud shadow from clear sky is simply a more difficult problem than in others.In forest disturbance identification, mislabeling cloud and cloud shadow as clear sky (omission) is generally a larger problem than mislabeling clear areas as clouds or cloud shadow (commission), because dark or bright spots can be mistaken for ephemeral disturbance [17].Commission errors become important, however, in areas with frequent cloud cover, such as the tropics or mountainous regions, where actual clear sky views are rare.In these cases, proper identification of clear pixels is necessary to increase the number of observations and maximize the likelihood of comparing images with similar days of the year and phenology.We believe SPARCS provides a balance between these competing objectives.Although SPARCS consistently misses more cloud cover than FMask, it does so at the gain of substantially reducing commission error in otherwise obstructed scenes and, so, increases the likelihood of observing rare clear sky pixels in cloudy areas.

Creating a Multi-Temporal Composite
To illustrate the usefulness of our method, we generated a composite image from Landsat 5 TM images acquired over eastern Tennessee on four dates in the summer of 1990 (Figure 5).Though SPARCS was developed using data from Landsat 7 ETM+, since it does not use the panchromatic band, it can be easily applied to the TM archive.Two of the images in this example have significant cloud cover.One scene, from July 6, is mostly clear sky, though a few cloud and cloud shadow pairs are detected and removed from the composite.A scene from August 16, contains several thin clouds.The scene weights shown are the sum of the clear sky and water memberships from the SPARCS output, multiplied by a deceasing function of distance from August 1, which is used to account for shifts in phenology, as images are acquired further away from the target day.Additionally, the algorithm provides the total of the weights used (Figure 5 bottom right), which can be useful in calculations and analyses further down a processing pipeline to determine, for example, that insufficient data was available for an area during a certain year or that a seemingly anomalous area actually has significant support.
In the scene from 16 August, large portions of the thin cloud are assigned intermediate weights due to classifier uncertainty.By squaring these values, our algorithm trusts these areas substantially less than the same areas in clear images.By providing a continuous measure of uncertainty, though, different algorithms and operators can choose their own thresholds for how conservative they wish to be with data inclusion.The classifier is also consistently uncertain about several areas around Douglas Lake, the body of water in the northern part of the images.However, because the algorithm takes a weighted mean of each image on a pixel-by-pixel basis, only the relative weight through time is important, allowing them to be combined successfully.They are, however, somewhat more susceptible to contamination by clouds or cloud shadow, as the relative difference between the weights of contaminated and uncontaminated pixels is less.

Conclusions
We presented a neural network approach to detect cloud and cloud shadow, as well as water and snow/ice, in Landsat TM and ETM+ imagery: SPARCS (Spatial Procedures for the Automated Removal of Clouds and Shadow).SPARCS uses only single date imagery, does not rely on ancillary datasets and outperforms another high quality method that operates on similar constraints with an overall accuracy of 98.8% compared to 95.3%.Additionally, it is completely automated and does not require specifying new parameters for different scenes, and the classification of a Landsat scene completes in under 5 min on a desktop computer using an AMD Athalon II 3.1 GHz dual-core processor with 8 Gb of RAM, meeting our design goals.
Unlike other cloud and cloud shadow detection algorithms, SPARCS is a fuzzy classifier; crisp classification can be achieved by labeling each pixel as the highest-valued membership class and then using the variance among class memberships as an accompanying measure of uncertainty.Knowing uncertainty allows spatial analyses that utilize SPARCS-generated cloud masks to create more accurate spatial products that have more robust estimates of error.
We explored the inclusion of spatial information as an input to the neural network classifier and found limited support for their inclusion.No method summarizing spatial information increased overall accuracy by more than 0.5%.However, a post-processing stage using expert-defined rules increased accuracy by 3.5%.Of particular usefulness is the rule to differentiate between cloud shadows and terrain shadow that combines predicted cloud shadow locations from solar geometry and cloud locations with neural network output in the cloud shadow and water classes.Inclusions of data from larger spatial areas summarized using total variation regularized denoising (with α > 0.1) or using a log-polar representation of the local neighborhood [32], which has promise in the field of robotic vision, as neural network inputs may be useful avenues for future research.However, these methods are computationally intensive.We believe that a multi-stage method that first classifies several single-date scenes of the same location using a method, such as the one described here, and then using those classifications in a second-stage multi-date classifier to resolve cloud and cloud shadow within each single-date scene is the most promising way forward.

Figure 1 .
Figure 1.WRS2 path/row locations of imagery used in training (black) and testing (red).

Figure 3 .
Figure 3. Image classification of a sub-scene from New South Wales, Australia (WRS2 path/row 89/82), acquired on 21 April 2001, using SPARCS (left) and FMask (right) with confusion between the classifications and evaluation masks highlighted.

Figure 4 .
Figure 4. Image classification of a sub-scene from Hidalgo, Mexico (WRS2 path/row 26/46) acquired on 1 February 2001 using SPARCS (Spatial Procedures for Automated Removal of Cloud and Shadow) (left) and FMask (right) with confusion between the classifications and evaluation masks highlighted

Figure 5 .
Figure 5.A 30×30 km region in eastern Tennessee from four Landsat 5 TM scenes acquired during the summer of 1990 (top row) and their respective weights from zero to one (black to white) from SPARCS (second row), where weights of zero signify contaminated or unusable pixels.The results from compositing (bottom left) and the sum of the weights used to determine each pixel (bottom right) are included.

Table 1 .
All network configurations included aspatial ETM+ bands, but varied in network size (h) and the type of spatial inputs.TVR, total variation regularization.

Table 2 .
Multiway ANOVA of accuracy over the 12 evaluation sub-scenes for all neural networks, without applying post-processing spatial procedures.Mean ranks of different methods for including spatial information in the network (left) and of different numbers of hidden nodes (right), before (top) and after (bottom) applying post-processing spatial procedures.Methods with the same letter are not significantly different by Tukey's honestly significant difference criterion.

Table 3 .
Multiway ANOVA of accuracy over the 12 evaluation sub-scenes for all neural networks after applying spatial post-processing procedures.

Table 4 .
Agreement over all 12 test sub-scenes for SPARCS and FMask compared to the evaluation masks.

Table 5 .
Agreement over all 12 test sub-scenes for SPARCS and FMask compared to the evaluation masks.