The sensors aboard Landsat 8 have been collecting high-quality imagery of the Earth since 2013. Free and open to the public, with global wall-to-wall coverage of land surfaces at an ecologically meaningful spatial resolution of 30 m, Landsat imagery is one of the most useful resources for ecological monitoring and wildland management [1
]. However, harnessing the power of the Landsat archive to detect and describe changes on the Earth’s surface hinges on researchers’ ability to detect and aggregate clear-sky observations uncontaminated by clouds and cloud-shadow [3
Due to this necessity, screening for clouds has always been essential for making the full use of Landsat’s spectral imagery. Early algorithms were designed to provide scene-level estimates to be included in metadata and enable human operators to make informed decisions before purchasing and downloading imagery [7
]. Later, more robust algorithms attempted to generate masks of clear-sky views within scenes and used ancillary information in addition to the spectral values in the image, such as elevation, sun angle, and cloud temperature [8
] or the spatial relationships of predictions within a scene [9
]. Currently, Landsat 8 imagery is accompanied by a bitwise quality mask (BQA) that encodes information about each pixel’s quality and includes masks for clouds, cloud-shadows, and snow/ice. These are generated using the CFMask algorithm, which was shown to be both reliable over a large evaluation dataset as well as computationally suitable for application over the entire archive [10
In the last decade, deep convolutional neural networks (CNN) have revolutionized image recognition [11
]. Deep CNNs are fundamentally neural networks with two key modifications. The ‘deep’ designation means that instead of having only one or a few hidden layers, they have dozens, enabling complex features to be constructed from more primitive features learned at early layers of the network. ‘Convolutional’ refers to the network learning sets of two-dimensional convolutional filters that are applied across the image, as opposed to a traditional neural network that treats each input pixel of an image independently. Learning small filters greatly decreases the number of total weights to learn while also allowing flexibility as to where in the image objects are located. Early versions of these algorithms generated either a single conceptual label or a list of labels with associated probabilities for the subject of the scene [12
]. These networks have since been used across a variety of remote sensors to classify land covers or create cloud masks by feeding the networks chips from the image and, typically, predicting a single central pixel [16
Recently, CNNs have been applied to the cloud and cloud-shadow detection problem, including state-of-the-art algorithms to detect atmospheric obstructions in Meteosat [21
], Sentinel-2 [22
], and multiple sensors [23
]. Additionally, several deep learning approaches have been developed for Landsat imagery [25
], in part due to free access to the high-quality training and evaluation data for these sensors that was used to validate the CFMask algorithm, and which is freely available from the United State Geological Survey. These data includes the Landsat 7 and Landsat 8 Biome Cloud Cover Assessment Validation Data (Biome) as well as the dataset developed for training and evaluating the algorithm described in this paper (SPARCS, or Spatial Procedures for Automated Removal of Cloud and Shadow) [10
]. Some of these algorithms also use semantic segmentation approaches and represent important improvements with classification accuracy rates of approximately 91% on the SPARCS dataset [28
], which approaches human accuracy.
Two insights make training semantic segmentation with CNNs functional. First, deconvolution layers enable the network to produce outputs in a higher resolution than the inputs; these are applied after pooling several times to re-expand the receptive field back into the original resolution. The convolution plus pooling ensures that the network learns spatially relevant information about the image set, and the deconvolution allows the network to meaningfully apply that information when determining relationships between nearby pixels [29
]. Second, if all layers in the network are convolutional (i.e., a fully convolutional network, FCN), then the input size of the image to be classified is constrained only by hardware, rather than being fixed to the size chosen during training [30
]. In the cases where a typical CNN architecture with an intermediate dense, fully connected layer predicts a central region of the input image, that central region size is fixed, and the strided chips of the original image need to be fed into the network and then reassembled. In the most extreme cases, only a single central pixel is predicted from the strided chip. Many of the convolutions from the border around that central region, which inform the central classifications, could also be used to predict pixels neighboring that central region, but are instead discarded between strided predictions. This wastes significant computation, since many of the same convolutions are performed multiple times on the same data. In a FCN, the size of the central region is not fixed (though the ‘border’ size is), so one can simply supply more of the original image and receive a larger area of predictions, reducing repeated computation.
Technological advances in the form of graphical processing units specially designed for training neural networks, dubbed tensor processing units (TPUs), enable comparatively rapid training and evaluation of complex network architectures. TPUs reduce training time from weeks (typical when running on CPUs) to hours and image prediction from hours to the few seconds needed to read and write the data with negligible processing time [31
Humans can easily identify clouds and cloud shadows in most single-date Landsat imagery when given spatial context. This insight led to the SPARCS algorithm (Spatial Procedures for Automated Removal of Cloud and Shadow), which was developed in 2014 for the Landsat 4 and 5 Thematic Mapper [9
], and here we extend it to Landsat 8. Similar to the original SPARCS, we take a neural network approach, although here we use a many-layer network instead of the original single-layer network. Further diverging, the “Spatial Procedures” are built into the network in the form of convolutional and deconvolutional layers, which the network uses to learn which spatial features are important to the cloud and cloud-shadow identification task.
In this manuscript we present a fully automated algorithm capable of identifying clouds and cloud-shadows in Landsat 8 imagery that produces errors on par with humans while also being sufficiently computationally efficient to feasibly process the entire archive. We describe the neural network architecture of our method and compare the method to the quality bands distributed with Collection 1 Landsat data and an additional third-party dataset of imagery.
2. Materials and Methods
2.1. Training and Evaluation Data
Landsat imagery is captured and organized geographically along the WRS2 path/row system. Eighty unique path/rows were selected by stratifying across each of the 14 World Wildlife Fund terrestrial Major Habitat Types (MHTs) plus ‘Inland Water’ and ‘Rock and Ice’ for each of the seven biogeographical realms (64 scenes, not all habitat types occur in each realm), plus an additional scene from each MHT chosen randomly [32
] (Figure 1
). For each path/row, a single Landsat 8 image acquired during 2013 through 2014 was selected at random and downloaded from the USGS Earth Explorer. Since data from different areas within a single Landsat image have similar land cover and share atmospheric conditions and acquisition variables such as sun angle, manually classifying an entire scene is a redundant use of classification effort. Instead, a single 1000 px × 1000 px subscene was selected non-randomly from each image to ensure the presence of the habitat type of interest and, where possible, a mix of clouds, clear-sky, and water.
To facilitate manual labeling, false-color images were generated by mapping the shortwave-infrared-1 band (B6) to red, the near-infrared band (B5) to green, and the red band (B4) to blue. A single interpreter labeled each pixel based only on visual interpretation using Photoshop; no thresholding, clustering, or other automated/mathematical approaches were used to assist labeling. This is in contrast to other publically available datasets, such as the Landsat 8 Biome Cloud Cover Assessment Validation Data (Biome) [10
]. Generating the training data in this manner removes many telltale artefacts in label masks—areas that are a speckled mixture of two classes, halos between objects and backgrounds arising from gradients, and small or thin objects omitted due to minimum mapping units—and enables more powerful learning algorithms.
Pixels were labeled as either no-data, clear-sky, cloud, cloud-shadow, shadow-over-water, snow/ice, water, or flood. During training and validation, the shadow-over-water class was combined with the shadow class. The flood class was re-coded as water or clear-sky, as appropriate for land-cover type, due to insufficient examples and high spectral variability. Water and snow/ice are both land-cover types and the result of short-term weather conditions. Including them in masks enables analysts to decide how to treat these conditions for a specific problem while also providing additional information to time-series algorithms in support of automated decision making.
2.2. Neural Network Architechture
The machine learning classifier used for the cloud screening task is a deep, fully convolutional neural network with 20.4 million weights (Figure 2
). It predicts six classes of interest: no-data, clear-sky, cloud, cloud-shadow, snow/ice, and water. The classifier can be conceptually separated into three phases: 1. convolution, 2. deconvolution, and 3. output.
The convolution phase is characterized by a series of two-dimensional (2D) convolution layers interrupted by 2 × 2 max pooling layers. Each 2D convolution layer learns N filters of size 3 × 3 × Depth, with N (and thereby the depth of the following convolution) increasing as the max pooling layers decrease the effective resolution. The max pooling layers examine each non-overlapping 2 × 2 window and pass through only the maximum value, reducing resolution. Through this process, spatial information is aggregated across the image, trading resolution for an increasing number of filters describing increasingly complex spatial relationships. The CNN architecture in this stage contains approximately 16.5 million weights, out of approximately 20.5 million weights for the entire network, and is identical to that in VGG-16 [12
]. To benefit from transfer learning, weights in analogous layers are initialized to those from VGG-16 and are prevented from updating for the first 10 epochs to encourage later layers to converge toward using the VGG-16 outputs.
In the deconvolution phase, the information encoding spatial structure is used to reconstruct the spatial resolution. This phase makes use of 2D deconvolution layers, also referred to as the transpose of 2D convolution. These layers learn filters of size learn 2 × 2 × Depth that each have four separate outputs arranged as a 2 × 2 window. In this way, the output from each deconvolution layer doubles the resolution of the input. As the resolution increases, the number of filters learned and used for prediction decreases, in an inverse pattern to the convolution phase. After two of these upscaling steps, the moderate resolution features from Phase 1 are directly added to the deconvolution outputs, allowing the network to use both the spatial information reconstructed from the large-scale features and the more moderate-scale features. The deconvolution phase fully returns the data back to its original resolution.
The output phase contains a novel feature of our network: data flow splits into two branches to discourage the network from simply using fine-scaled features. Both branches predict the same labels, but in the first, no fine-scale spatial features are included, forcing the network to learn useful features for the classification task in the convolution and deconvolution phases. In the second, the early fine-scale features are combined with the output from the deconvolution phase to enable the network to fine-tune spatial structure. The loss from these outputs is combined, with the loss from the first weighted twice as strongly as the loss from the second. Only the second branch, which includes the fine-scale features, is used during prediction. Additionally, 2D spatial dropout [33
] is used as a regularization layer. During training, this sets 3/8 of the features to 0, forcing the network to learn redundant patterns that ideally correspond with different avenues of evidence. During prediction, the dropout is omitted to allow all features to contribute to classification.
Due to memory constraints during training, a 28-pixel border is clipped from all sides of the output within the network. This could have been resolved by using a smaller window size during training, but since the edges of each input image incorporate many no-data pixels, clipping has the benefit of removing the least informed predictions.
All layers prior to the final prediction layer use a reticulated unit activation function; the final layer uses a softmax activation to convert activation energies to class probabilities.
Prior to processing, reflectance values in Landsat 8 imagery were corrected to top of atmosphere reflectance [34
]. All spectral bands were used except for the 15 m panchromatic band (B8). The two thermal bands were summed to create a single thermal feature to avoid any spurious information arising from varied processing around the stray light issue and to facilitate adapting the network to other Landsat sensors, resulting in a total of nine predictive features. Each feature was normalized using the feature-wise mean and standard deviation of the training dataset.
Six subscenes were set aside for validation and two were set aside for testing, leaving 72 for training. These 1000 px × 1000 px subscenes were padded to include a 64 px border of no-data to simulate predicting at the edges of the Landsat scenes. During training, each requested example was a 256 px × 256 px window randomly clipped from one of these padded subscenes. Each epoch consists of 1440 samples from all subscenes, and the network was allowed to proceed for up to 100 epochs or until the validation sample stopped improving for five epochs, whichever occurred first. In practice, all epochs ended early due to a lack of validation improvement. Training examples were batched and randomized following best practices [35
]. The network predicts a central 200 px × 200 px region from the 256 px × 256 px window during training.
Each training example presented to the network is equivalent to 40,000 single-pixel examples. Since these are all contiguous, stratified sampling based on class is not possible. To partially mitigate this issue, weighted Kullback–Leibler divergence [36
] was used as the loss function, with pixels labeled clear-sky given half weight. Clear-sky pixels make up approximately 65% of the total dataset, but are also the most spectrally diverse class. This simple reduction was sufficient to allow the network to reliably discern less frequent classes.
The model was fit using the Adam optimizer with default parameters [37
During prediction, the network produces classifications for a region equal to the input size minus a 28-px buffer along each edge. For this paper, the full 1000 px × 1000 px validation and testing subscenes were no-data buffered by 28 pixels and predicted in a single pass through the network.
Training and evaluation were performed using TensorFlow in Python, with the CNN model specified with Keras. Computation was performed using Google Cloud Services and the TensorFlow Research Cloud. Data was stored in the TFRecord format in Google Cloud Bucket, program control was executed using Google Compute Engine, and training was accelerated using tensor processing units (TPUs). TFRecord is the format recommended by the TensorFlow Data Input Pipeline best practices manual [35
], and provides a way to compactly serialize and store information for efficient retrieval across the Google Compute Engine network. This stack decreased computation time by several orders of magnitude compared to a multi-processor CPU scenario, allowing rapid exploration and the experimentation of different architectures and hyperparameters.
The CNN model was evaluated against the two test scenes as well as the six validation scenes. The test scenes were not used during training, whereas validation scenes were used to monitor progress and determine early stopping. Cohen’s kappa [38
], full confusion matrices, and accuracy and recall metrics are calculated between the predicted labels and the manual labels for each subscene. Then, these results were compared to those calculated between the quality bands included by USGS with the Landsat data and the manual labels. For the quality band masks, a pixel defaulted to clear-sky but was considered cloud if the cloud flag was set (bit 4), a cloud-shadow if the high cloud-shadow bit was set (bit 8), and snow/ice if the high snow/ice bit was set (10). For cloud-shadow and snow/ice, this corresponds to labeling pixels when the CFMask algorithm has medium or high confidence in the condition. CFMask does not distinguish water; these pixels were considered clear-sky during evaluation.
Since clouds and cloud-shadows have fuzzy boundaries, we allowed two pixels of leeway at cloud and cloud-shadow borders within the manual labeled masks, where either of the classes at the boundary were counted as correct. This was used for masks generated with both SPARCS and CFMask, and assures that reported errors are not simply confusing small amounts at edges but are real failures to detect objects or the full extent of objects.
Finally, four subscenes were selected to be manually labeled twice to measure interpreter consistency and the limit of human accuracy. These subscenes were reflected and rotated and then reinterpreted a year after the initial interpretation. The results between these interpretations are combined and presented as a single confusion matrix.
The neural network architecture used here has several design features useful for pixel-wise image segmentation. First, it uses only convolutional operators; therefore, at no time is the 2D spatial arrangement of the input image lost or constrained to be a specific size. This allows for images of any size greater than the reduction by down-sampling (here, 32) to be used during prediction. This provides flexibility and convenience, as the same network can be used for whole Landsat scenes, small regions of interest, or arbitrarily sized tiles.
Second, the network is able to provide labels for a large number of pixels per pass—omitting only a small border where estimates lose reliability due to edge effects. Compared to CNN methods that provide only a single central output, this method greatly reduces the number of redundant computations, since many of the same convolutions over the same data are needed when estimating neighboring pixels.
This network is large, with 20.5 million weights; so many weights introduces the risk of overfitting the network. Here, we attempt to mitigate that risk using transfer learning. The convolutional phase of the network is initialized with VGG-16, and the remainder of the network is coerced into using those filters by not allowing those weights to be updated in the first few epochs. Such a large network is necessary to provide a large receptive field—i.e., the area around each pixel that is able to inform its classification. One solution to reducing network size while preserving the receptive field is to use convolutions with larger strides in early stages to quickly decrease resolution and then omit later convolution layers with hundreds of layers. However, this also reduces the number of features used to include spatial structure. Dilated convolutions [39
] can enlarge the receptive field using fewer layers while retaining feature density. This is a strategy used successfully on similar classification problems discriminating only clouds and cloud-shadows from clear-sky pixels [24
]. However, early trials in this study with dilated convolutions produced unacceptable outputs with a structured speckle pattern when discriminating between water and shadows from both terrain and clouds; future work may be able to overcome this issue.
For machine learning algorithms to achieve exceptional accuracy, the training data used must itself be of exceptional accuracy. However, the cloud and cloud-shadow classification task has an innate subjectivity given that clouds and their shadows have diffuse edges. Our assessment of this subjectivity found that there was actually more disagreement between images reinterpreted a year apart than disagreement between the CNN classifier and the evaluation scenes. Some of the disagreement in reinterpreted images comes from choosing images with a representative range of land-cover and atmospheric conditions rather than being a statistically representative sample. However, this does not fully explain the image from path/row 201/033, which was used in both the reinterpreted and the evaluation sets. On this image, the algorithm achieved a 98.4% accuracy whereas the reinterpretation only agreed over 96.2% of the image. This is surprising and reflects how the algorithm learned the consistent set of subjective decisions made when the interpreter labeled all of the training imagery, which was performed within a few weeks. However, those judgement calls made by the interpreter shifted to different preferences after a year. Due to this inconsistency, we believe that the CNN algorithm is at or very near the quality that can be performed by human interpreters.
Two challenges hamper the evaluation of cloud and cloud-shadow detection methods. First, accuracy metrics are conceptually skewed—a method that performs at 85% accuracy naively sounds good, but produces useless masks. Second, most users are interested in aggressively removing obstructions; those performing automated time-series analyses are advised to dilate the cloud and cloud-shadow masks to both ensure the entire fuzzy object is covered and to remove the image corruption from the thin haze around such objects. This makes errors at boundaries less important than errors where algorithms miss whole clouds or incorrectly label clear-sky regions as clouds. In the first, dilation is of no help if there is no seed to dilate, and in the second, dilation will subsequently remove large amounts of useable imagery, which could be quite valuable in areas with persistent cloud cover such as the Amazon. Clever methods that perform a size-weighted object detection are needed for proper evaluation.