Weakly Supervised Deep Learning for Segmentation of Remote Sensing Imagery

: Accurate automated segmentation of remote sensing data could beneﬁt applications from land cover mapping and agricultural monitoring to urban development surveyal and disaster damage assessment. While convolutional neural networks (CNNs) achieve state-of-the-art accuracy when segmenting natural images with huge labeled datasets, their successful translation to remote sensing tasks has been limited by low quantities of ground truth labels, especially fully segmented ones, in the remote sensing domain. In this work, we perform cropland segmentation using two types of labels commonly found in remote sensing datasets that can be considered sources of “weak supervision”: (1) labels comprised of single geotagged points and (2) image-level labels. We demonstrate that (1) a U-Net trained on a single labeled pixel per image and (2) a U-Net image classiﬁer transferred to segmentation can outperform pixel-level algorithms such as logistic regression, support vector machine, and random forest. While the high performance of neural networks is well-established for large datasets, our experiments indicate that U-Nets trained on weak labels outperform baseline methods with as few as 100 labels. Neural networks, therefore, can combine superior classiﬁcation performance with efﬁcient label usage, and allow pixel-level labels to be obtained from image labels.


Introduction
Automatic pixel-wise classification of remote sensing imagery enables large-scale study of land cover and land use on the Earth's surface, and is relevant to applications ranging from deforestation mapping [1] and development surveyal [2] to ice sheet monitoring [3] and disaster damage assessment [4]. In computer vision, pixel-wise classification is a classic task known as semantic segmentation, and has been tackled with increasing success in recent years due to the development of deep convolutional neural networks (CNNs) [5][6][7][8][9] and large labeled benchmark datasets on which to test architectures [10][11][12][13]. The advantage of CNNs over machine learning methods that take the features of a single pixel as input-such as random forests, support vector machines (SVMs), and logistic regression-is their ability to consider a pixel's context (that is, the pixels near that pixel) in addition to the pixel's own features when performing classification [14]. This context may be helpful when, for example, a pixel of grassland and a pixel of cropland share similar phenological and spectral features, but a wider view of cropland reveals that it is divided into rectangular parcels while grassland is not.
Traditional training of CNNs for segmentation, including the latest methods developed for remote sensing imagery (Section 2), provides the model with an image as input, and computes the loss 1.
With sparse pixel labels, we trained a CNN to perform cropland segmentation by masking out all but one pixel in each image on which to compute the loss. We show that randomization of this pixel's location is important for segmentation accuracy across the entire image and ability to use single-pixel classification as a proxy task for segmentation.

2.
With image-level labels, we used class activation maps (CAMs) developed by Zhou et al. [21] to extract segmentation predictions from an intermediate CNN layer. The CAMs were converted to predictions via a thresholding algorithm that takes into account the distribution of image-level labels in the dataset. 3.
We demonstrate that, while CNNs are already known to outperform other machine learning methods when trained on large datasets with high quality labels, they can also outperform random forest, SVM, and logistic regression models when trained on small numbers of weak labels. It is therefore possible to combine the high performance of deep neural networks with ground truth labels that are easy to obtain. The transfer of image labels to pixel labels also demonstrates a new possibility obtained by moving from established machine learning methods to deep learning. Figure 1. Examples of (a) Landsat images, (b) their corresponding full segmented labels, and (c,d) two types of weakly supervising labels. (c) Single pixel labels are available from datasets of geotagged points. Gray pixels' labels are not known. (d) Image-level labels provide high-level information about the image but labels of individual pixels are not known. We demonstrate methods to predict the full segmented label, given only one of the weakly supervising labels.

Related Work
The growing body of research that adapts deep neural networks created for natural image segmentation to remotely sensed imagery has largely focused on two areas. The first is creating architectures that allow CNNs and recurrent neural networks (RNNs) to adapt to the unique characteristics of satellite and aerial imagery. These works have successfully demonstrated the use of CNNs and RNNs for land cover classification [22,23], cloud masking [24,25], building footprint segmentation [26][27][28][29], ship segmentation [30], and road segmentation [31,32]. In each case, the ground truth used is densely segmented labels. A second group of pursuits has been to create large datasets, usually of very high resolution imagery, annotated with segmentation ground truth labels [33][34][35][36]. The most recent and largest of these include DeepGlobe 2018 [37] and BigEarthNet [38]. These datasets, especially when paired with competitions on platforms such as Kaggle and CrowdAnalytix, provide much-needed benchmarks and are catalysts for method development.
While these methods show us what is achievable on large, well-labeled datasets, there remains a mismatch between the datasets that are available or easy to collect in many applications and the tools built so far to perform segmentation. First, many deep neural networks contain vast numbers of trainable parameters and require large datasets to achieve high performance, and there are often only small quantities of ground labels available for training. To address the small data regime, some recent works have explored the use of transfer learning and semi-supervised learning techniques. Transfer learning makes use of labeled data in a setting similar to the problem of interest, which lacks labels. Kaiser et al. showed that accurate segmentation of a city's buildings and streets could be produced by training a CNN on large-scale, highly noisy labels from OpenStreetMap in other cities [39] , while Kemker et al. trained their CNN on synthetic aerial imagery before fine-tuning on real data [40]. Semi-supervised learning boosts performance when large quantities of unlabeled data can be leveraged to augment labeled data. For example, Kang et al. created pseudo-labeled samples using non-deep learning methods to improve deep learning-based segmentation on small labeled datasets [41].
A second source of mismatch, and the one addressed in this work, is that segmentation predictions are often desired in settings where ground truth labels that are available or feasible to collect are at the point or image level. Methods to address this for natural images include work by Hong et al. that used "bridging layers" to share information between separate classification and segmentation networks [42], and work by Pinheiro and Collobert where a CNN trained on image classification contained intermediate input-sized layers that were aggregated to obtain a segmentation prediction [43]. Our approach for coupling image classification and segmentation is similar to the latter in concept, but we combine a U-Net architecture with class activation mapping developed by Zhou et al. [21]. U-Nets were designed for image segmentation on small-to moderate-sized biomedical datasets [7], and class activation maps (CAMs) allow models that are trained for classification tasks to localize class-specific image regions from the target image. To use single-pixel labels for segmentation, we masked out all other pixels of an image when computing the loss, similar to how pixels with the "void" class label are masked out when training on the Pascal VOC segmentation dataset [10].
In cropland mapping, most work to date has performed segmentation using features at an individual pixel level [18][19][20]44,45]. These methods, which include random forests and SVM, have become easy to implement at large scale due to the development of platforms such as Google Earth Engine. However, unlike CNNs, they do not automatically take into account the spatial context of each pixel, which can lend a great deal of information about whether that pixel is cropland or not. Methods have been created to fuse "object-based" features at larger spatial scales with pixel features to improve random forest-and SVM-based cropland maps [45,46], but an advantage of CNNs is that the network learns how to use spatial context to aid segmentation and is not limited by hand-engineering.

Dataset
We describe our study area in the Midwestern United States, the remote sensing dataset used for classification and segmentation, and how we obtained image-level and pixel-level labels.

Study Area
The study area is shown in Figure 2a, spanning from 37 • N to 41 • 30 N and from 94 • W to 86 • W. It covers an area of over 450,000 km 2 in the United States Midwest, intersecting the states of Illinois, Iowa, Indiana, Missouri, and Kentucky. We chose this region because the United States Department of Agriculture (USDA) maintains high quality pixel-by-pixel land cover labels across the US, allowing us to evaluate the quality of our segmentation. Furthermore, we applied our methods to a large area to show that they can scale spatially. Land cover-wise, the study region is 44% cropland and 56% non-crop (mostly temperate forest).

Remote Sensing Data
The Landsat Program is a series of Earth-observing satellites jointly managed by the USGS and NASA. Landsat 8 provides moderate-resolution (30 m) satellite imagery in seven surface reflectance bands-ultra blue, blue, green, red, near infrared, shortwave infrared 1, and shortwave infrared 2 [47]-designed to serve a wide range of scientific applications. Images are collected on a 16-day cycle.
Using Google Earth Engine, we found all the Landsat 8 Surface Reflectance Tier 1 images that intersect the study area and were taken between 1 January 2017 and 31 December 2017. We then computed a single composite image from this image collection by taking the median value at each pixel and band. Since Landsat imagery, and satellite imagery more broadly, is affected by different types of contamination, such as clouds, snow, and shadows [48], we used the quality assessment band delivered with the Landsat 8 images to mask out clouds and shadows prior to computing the median composite. The resulting seven-band image spans 4.5 degrees latitude and 8.0 degrees longitude and contains just over 500 million 30-by-30 meter pixels. To prepare the imagery to be input to a CNN, we divided the composite into approximately 200,000 tiles each of dimension 50 × 50 pixels. The geographic split reduces spatial correlations that may lead to inflated validation and test set accuracies. The non-test set is split into 10 folds for cross-validation; only one fold (darker blue) is visualized here.

Pixel-Level Labels
The Cropland Data Layer (CDL) is a raster geo-referenced land cover map collected by the USDA for the entire continental United States [20]. It is offered at 30 m resolution, so that each Landsat 8 pixel has a corresponding CDL label. CDL includes 132 detailed classes spanning field crops, tree crops, developed areas, forest, and water. In our dataset across the corn belt, we observe 78 CDL classes. The four most common classes-deciduous forest, corn, soybean, and grassland/pasture-account for 85% of the dataset. The remaining classes are each less than 5% of the dataset. For our classification task, we aggregated all crop classes into a single "cropland" class, and non-crop classes into a single "non-cropland" class.
For the remainder of our study, we treat CDL labels as ground truth and use them to evaluate the performance of our methods. The quality of our evaluation therefore depends on the quality of the CDL labels. CDL is created yearly using imagery from Landsat 8 and the Disaster Monitoring Constellation (DMC) satellites, and a decision tree algorithm is trained and validated on ground samples. The accuracy of CDL labels varies by class; for the top classes in our dataset, accuracies detailed in the CDL metadata generally exceed 90% [20]. Since we simplify the CDL labels into {0, 1} for non-crop and cropland, and non-crop and cropland discrimination is easier than crop type discrimination, the binary labels used to supervise our image classification are likely even more accurate than CDL is for individual crop classes.

Image-Level Labels
Since our goal is to evaluate the possibility of generating segmentation labels from a CNN trained on image-level labels, image-level labels are needed for our dataset. For each 50 × 50-pixel tile (each covering 2.25 km 2 ), we computed a binary label ∈ {0, 1} based on whether the majority of pixel-level CDL labels in that tile are crop pixels or not. The label 1 indicates that the tile is "more than 50% cropland" and the label 0 indicates that the tile is "less than 50% cropland". The class balance of the dataset is shown in Table 1. This labeling scheme was chosen because it is quick for humans to assess, and is therefore a realistic label to crowdsource or ask domain experts to generate de novo. We leave for future work the exploration of other labeling thresholds or schemes, such as mere presence of a class of interest.

Training, Validation, and Test Splits
Since land cover type and characteristics vary smoothly across space, adjacent tiles may contain parts of the same crop field, forest, city, etc. A random split of the 200,000 tiles into training, validation, and test sets will therefore result in a test set performance that overestimates how well the model generalizes to new areas within the study region.
To reduce the performance-inflating effect of spatial auto-correlation, we split the study region into 64 rectangles geographically, and randomly assigned 50 rectangles to a training and validation set and 14 rectangles to a test set. Within the training and validation set, the 50 rectangles were split into 10 folds of 5 rectangles each. One such split is shown in Figure 2b; the other nine are shown in Figure A1 (Appendix A.1). With this geographic train-validation-test split, the test set metrics are an estimate of model performance when applied to a new area within the study region (but not a measure of model generalization to other regions in the US or the world).
To tune machine learning hyperparameters, we trained the model on 9 folds (45 rectangles) and validated on 1 fold (5 rectangles). Using k-fold cross-validation (k = 10) allows us to obtain error estimates when tuning hyperparameters and evaluating the performance of different methods. Note that all splits remained the same across experiments done with different training set sizes; a training set size of 1000 is a sub-sample of the tiles in the corresponding full training set.

Methods
In this section, we describe the methods used to (1) train a CNN on dense segmentation labels, (2) train a CNN on single-pixel labels for segmentation, and (3) transfer a CNN trained on image classification to the task of segmentation. We also describe data augmentation techniques used to expand our effective dataset size and baseline models such as random forests.
An overview of the CNN architectures trained with single-pixel labels and image labels are depicted in Figures 3 and 4, respectively.

Data Augmentation
At training time, we employed random rotations and flips to increase our effective dataset size, with the assumption that image-level and segmentation labels are invariant to rotations and flips. We do not employ stretching or rotations that are not a multiple of 90 • , since doing so may alter the image-level label of a tile.

Convolutional Neural Network Architecture
The deep learning models discussed in this paper share the same core U-Net architecture, illustrated in Figures 3 and 4. The U-Net is a convolutional neural network designed originally for segmenting biomedical imagery [7], intended to perform well with a relatively small number of training images and to yield segmentation at the same resolution as the input image. An input image with C channels, height H, and width W (denoted as dimension C × H × W) is "encoded" by layers in the first half of the network to yield a low dimensional representation. This representation contains high-level information on the image being segmented. The second half of the network then "decodes" the representation back to the original image height and width, with K − 1 channels parameterizing the categorical probability distribution over K classes at each pixel. Figure 3. A U-Net (a type of CNN) with two down-convolutional blocks and two up-convolutional blocks is shown here. A block is comprised of two convolutional-batch norm layers followed by a max pool or up-convolutional layer. We used masking at the loss computation step to train a U-Net on single pixel labels. The pre-masking output is the network's segmentation prediction.
For our binary cropland versus non-cropland classification task, K is 2, so the U-Net output is of dimension H × W. Since spatial information is lost during encoding, features from the first half of the network are concatenated to those of the second half to re-introduce spatial information and allow for precise segmentation. In our U-Net, the input image is of dimension 7 × 50 × 50, and the output is 48 × 48 due to max-pooling and up-convolutional layers operating in multiples of 2 ( Figures 3 and 4). We compared the innermost 48 × 48 pixels in y to the output when computing segmentation accuracy.
Since neural network performance depends on a number of tunable hyperparameters, we used cross-validation with grid search to select the U-Net network depth, number of filters, L 2 regularization strength, learning rate, and batch size. The details of hyperparameter search, deep learning frameworks, and hardware used are described in Appendix A.3. For the remainder of this paper, we will show the performance of U-Net and U-CAM models with optimized hyperparameters shown in Table A4. We found that a U-Net with 4 encoding and 4 decoding blocks, with 64 initial filters, L 2 regularization of strength 10 −4 to 10 −3 , learning rate of 10 −3 , and batch size of 32 performed the best during cross-validation.
The U-Nets were trained across 20+ epochs; to calculate test set metrics, we chose the epoch with the highest task validation accuracy to evaluate on the test set. This was done for each training set fold to obtain standard errors.

End-to-End Segmentation Using Dense Labels
To get an upper bound on how well U-Nets can perform on cropland segmentation, we performed end-to-end training using dense labels. Training minimized the binary cross entropy loss, defined as a function of each sample as for model predictionsŷ of dimension H × W and segmentation label y of dimension H × W.
The notation y jk denotes the pixel at the (j, k) spatial location of y. Note thatŷ is a function of the input image x and model parameters θ, and each pixel's prediction is a real number in the range (0, 1) representing the probability of that pixel being cropland. For applications with more than 2 classes, the U-Net can easily be adapted to output more prediction classes per pixel, and training would minimize a categorical cross entropy loss.

End-to-End Segmentation Using Sparse Labels
Though our dataset has cropland labels at all pixels, we simulated the sparse label setting by sampling one pixel per input image to be the labeled one; all other pixel labels were masked out and not seen by the U-Net. More formally, for each input image, we sampled uniformly at random one of the 2500 pixels in y, whose spatial location we denote (j * , k * ), to be the only label in the image whose binary cross entropy is computed in the loss.
For an input sample, the masked U-Net loss can be written We emphasize that, although the position of the labeled pixel is different across tiles, it is fixed for the same tile during every epoch of training and evaluation. On a dataset actually comprised of single-pixel labels, there is no need for random sampling of the label position. Rather, one would sample a satellite image tile that contains the labeled point at a random position.
To obtain dense segmentation predictions from the trained U-Net, we simply skipped the masking step. The unmasked outputŷ from the network was compared against the dense segmentation label y to obtain full segmentation accuracies.

Image Classification
We modified the U-Net to perform image classification by replacing the last 1 × 1 convolution with a global average pooling layer followed by a fully connected layer that outputs a single number ∈ (0, 1). The global average pooling layer computes the mean value across all spatial dimensions of the input and is used to recover a class activation map (Section 4.5.2). A diagram of a 2-layer U-Net modified for image classification is shown in Figure 4a.
The classification task is to detect whether the majority (≥50%) of pixels in an image are in the "cropland" category. Recall that, for our US Midwest dataset, segmentation labels from CDL were converted into binary labels (Section 3.4). To train the model, we used an image-level binary cross entropy loss, which is defined for each input as whereŷ is the image-level model prediction and y is the image-level binary label.

Class Activation Maps
To derive segmentation from a network trained for image classification, we used class activation maps (CAMs) following the work of Zhou et al. [21]. CAMs arose from the discovery that intermediate layers of CNNs detect objects despite no supervision on the location of objects at the time of training.
A diagram of how to compute a CAM is shown in Figure 4b; it is the weighted sum of the last convolutional layer's outputs, where the weights are from the fully connected layer. Mathematically, the CAM is defined as for the last convolutional layer output f , fully connected layer weights w, and fully connected layer biases b. The sum is over the filter dimension c.
Notice that if f is of dimension C × H × W, then the CAM has dimension H × W. Intuitively, the CAM shows how much each pixel of the last convolutional output was "activated" for cropland. We discuss how this activation is converted to a valid probability in the next section.

Segmentation Threshold
The values of a CAM can in theory take on any real value, with a higher value indicating a higher activation for cropland in the corresponding pixel. In practice, we observed CAM values falling in the interval [− 10,10]. To convert the CAM to a segmentation prediction that is 0 or 1 at each pixel, we set a single threshold activation value, above which we will predict a value of 1 and below which we will predict a value of 0.
Notice that the threshold value cannot simply be assumed to be 0. Although a value of 0 evaluates to a cropland probability of 0.5 when passed through the last layer of the U-CAM network (sigmoid layer), the sigmoid layer's input is obtained via a weighted sum over filter dimensions and average over spatial dimensions. Since the sigmoid of a sum does not equal the sum of sigmoids, the threshold of 0 does not correspond to a probability of 0.5 at each CAM pixel. In practice, we observed optimal thresholds that deviated from 0, generally within the range [−1, 1].
To determine the optimal threshold, we found the threshold that maximizes image-level prediction accuracy on tiles in the training set. This algorithm proceeds as follows. At epoch t,

1.
Compute the CAM for each training tile as described in Section 4.5.2.

2.
Enumerate a possible set of threshold values V.

3.
For each training tile and possible threshold value v ∈ V, Let the prediction at pixel (j, k) be s v (j, k) = 1{CAM(j, k) ≥ v}. That is, if the CAM value is equal to or exceeds the threshold, predict that the pixel is cropland.
Compute the image-level predictionŷ v from the segmentation prediction s v in the same way that image-level labels were determined from the segmented ground truth (or human labeling): In other words, an image whose segmented prediction has a majority of pixels (≥50%) predicted to be cropland would be labeled 1; otherwise 0.

4.
Find the threshold that maximizes the accuracy of image-level predictions across all training tiles, i.e.,

5.
Return the segmentation prediction s v * for each training and validation image.
We point out that this way of determining a threshold and creating segmented predictions required another loop through the training set, which increased training time. If segmentation labels are available for some tiles in the training set, they can be used to find the threshold instead.

Baseline Models
We compared the masked U-Net and U-CAM methods against a few commonly used machine learning baselines: logistic regression, support vector machines (SVM), and random forests. All have been used in the field of remote sensing to classify land cover. For each method, we optimized over its hyperparameters across dataset sizes to provide the highest performing baseline possible; the best hyperparameters are shown in Tables A1-A3. Descriptions of the three baselines and the hardware we used to run them can be found in Appendix A.2.
The same training, validation, and test set splits were used for the baseline models as for the deep neural networks. For comparison against the masked U-Net, the center pixel of each image and its label were provided to the baseline models for training and validation. For comparison against the U-CAM model, the same image-level label ("more than 50% cropland" or "less than 50% cropland") was used to label all pixels in the image (50 × 50 = 2500 pixels) in the training set fed to baselines. In other words, all pixels in an image labeled "more than 50% cropland" are labeled as "cropland" and all pixels in a "less than 50% cropland" image are labeled as "non-cropland". Evaluation on the validation and test sets in this setting was, however, still performed using pixel-level labels.

Results
Here we summarize the results of the (1) U-Net trained on dense segmentation labels, (2) masked U-Net trained on single-pixel labels, and (3) U-CAM transferring image classification to segmentation. For a description of baseline model results, see Appendix A.5. Figure 5a shows fully supervised U-Net loss and segmentation accuracy across 20 epochs of training, averaged for the 10 training folds. Early in training, cross-entropy loss decreases rapidly and segmentation accuracy increases steeply. The model begins to perform well on the training set after only one epoch of training, while performance on the validation set slowly improves after more epochs (around 10 for n = 200). We viewed the fully supervised U-Net as an oracle that provides an upper bound on how well we can expect the U-Net architecture to perform at segmentation given the best possible labels. Note in Figure 6 that, at 100,000 training samples, U-Net test set accuracy reaches 92%, which is approaching the accuracy of CDL, our ground truth.

Obtaining Segmentation from Sparse Pixel Labels
When using one labeled pixel per image to supervise a U-Net for segmentation, we first observe that decreasing cross-entropy loss and increasing task accuracy (single-pixel classification) corresponds to increasing segmentation accuracy as the model trains (Figure 5b). The closer the correspondence between task accuracy and segmentation accuracy, the more we can use the task accuracy to select the best training epoch for segmentation and be confident in the implied segmentation accuracy.
We trained the masked U-Net model on tiles with either randomized or constant label positions, and found that randomness in the position of the labeled pixel across tiles was important for avoiding overfitting and achieving high correlation between task accuracy and segmentation accuracy ( Figure A2 and Appendix A.4). Because the task-segmentation correlation in the case of randomized label positions is close to 1.0 on the validation set (Figure 7a), a model with high task validation accuracy is nearly guaranteed to also yield a high segmentation accuracy. Figure 6a compares the test set accuracy of the masked U-Net against baseline and oracle methods across training set sizes from 10 to 100,000. Because our label classes are fairly balanced (Table 1), we primarily report our findings using the accuracy metric; the findings are similar for precision, recall, and F1-score metrics, shown in Table 2 for n = 100 and n = 1000. At training sizes below n = 100, the masked U-Net has lower accuracies for cropland classification than all three baselines. This suggests that it is difficult to learn the large number of parameters in the U-Net well with under 100 labeled pixels. At training sets larger than n = 100, however, the advantage of seeing a pixel's context-even without their labels-allows the masked U-Net to outperform the baselines. At n = 1000, the masked U-Net achieves a segmentation accuracy of 0.88, compared to SVM at 0.85, random forest at 0.84, and logistic regression at 0.81. The masked U-Net accuracy continues to increase with training size and approaches the performance of the U-Net upper bound; even at n = 100,000 the model still benefits from more training samples. Examples of the masked U-Net's segmentation predictions on the test set are shown in Figure 8, along with the random forest predictions on the same images. Across all samples, the masked U-Net produces predictions that are more spatially coherent-i.e., neighboring pixels are more correlated in label-than the random forest predictions. The U-Net also notably does not classify urban vegetation as cropland where the random forest does, illustrating the utility of seeing a pixel in its context.   Figure 5c shows the training and validation set performance of the U-CAM model across epochs. As loss on the image classification task decreases and accuracy increases, segmentation accuracy increases as well, despite the model never seeing any pixel labels. The correlation between image classification accuracy and segmentation accuracy is 0.91 on the validation set (Figure 7b), indicating that models that perform well on image classification generally perform well on segmentation as well, but there are outliers. This strong but incomplete correspondence between the two tasks suggests that locating the cropland pixels in an image is one way the model can tell whether an image is majority cropland, but it is not the only way. The presence of certain features-for example, a few densely clustered buildings-may be a strong enough signal for the model to classify an image as non-cropland or cropland without looking at the other parts of the image.

Obtaining Segmentation from Image Classification
Nevertheless, using image classification accuracy on the validation set to select the model for segmentation led us to choose U-CAM models that outperform the baselines and achieve segmentation accuracies exceeding 85% on the test set ( Figure 6b). Our baseline machine learning methods are not designed to extract pixel-level information from image labels, so we modified their input data to be individual pixels labeled with the image label. Image labels add significant noise to pixel-level training, and the accuracies of the random forest and logistic regression baselines are 4-6% lower than their counterparts trained on pixel labels. The U-CAM model performs better than the image-level baselines in segmentation accuracy at all dataset sizes, and also performs better than the baselines trained on pixel labels at n ≥ 100. At n = 1000, the U-CAM method achieves a segmentation accuracy of 0.86, compared to random forest at 0.79 and logistic regression at 0.77. Results for precision, recall, and F1-score metrics are shown in Table 3 for n = 100 and n = 1000; we observe that, while accuracy and precision of the U-CAM model are comparable to those of the masked U-Net, recall is significantly lower. Figure 8 shows examples of U-CAM segmentation predictions for the test set and the corresponding cropland activation maps extracted from the network. Like the masked U-Net predictions, the U-CAM segmentation is more spatially coherent than the random forest segmentation, which is very noisy due to the many incorrect training labels.

Weakly Supervised Segmentation
The methods assessed in this paper show that CNNs can be trained for segmentation using small datasets comprised of pixel or image labels. The masked U-Net and U-CAM models can achieve segmentation accuracies of over 85% with modest dataset sizes in the hundreds of labels, outperforming commonly used pixel-based methods like logistic regression, SVM, and random forest. These simple modifications allow the advantages of CNNs, namely their ability to account for spatial context and learn nonlinear transformations, to be combined with datasets that are easy and feasible in quantity for domain experts or crowdsourced workers to generate.
The selection of the best U-Net for segmentation using only weak labels requires high performance on the weakly supervising task to correspond to high segmentation performance. We showed that this is true when the position of labeled pixels is random across the training set (R 2 ≈ 1.0), while the relationship is not as strong when the labeled pixel is always in the center of the image (R 2 = 0.79). Under random labeling, the model can perform well on the task either by (1) classifying all pixels in each image correctly or (2) memorizing the locations of the labels in all training tiles and classifying those pixels correctly. The near-perfect correlation between task and segmentation accuracy on the validation set indicates that the U-Net accomplished the former. In contrast, when the center pixel is always the labeled one, the model does not have to correctly classify the other pixels in the image. This indicates that, if one is given a dataset of point labels, a random crop of remote sensing imagery should be extracted around each point, rather than tiles with the label always at a fixed position.
Meanwhile, performance on the image classification task has a correlation with segmentation performance of R 2 = 0.91. We hypothesize that this is because the global average pooling layer encourages the U-CAM model to perform classification via segmentation. In other words, the model performs well on image classification if its pixel-level predictions are correct on average. Furthermore, the skip connections of the U-Net enable spatial information from the input image to be kept and used to localize pixel labels. Ultimately, this high correlation makes it possible to pick the best model for segmentation using only image-level labels, an important proxy when there are few or no segmentation labels available.

Trade-Offs between Label Types
In light of our findings, researchers obtaining ground truth labels for segmentation de novo have their choice of dense labels (e.g., geospatially referenced polygons, densely segmented rasters), point labels (e.g., geospatially referenced points, pixel labels), or image labels. Our results shed some light on the trade-offs involved in choosing the label type. Figure 6 shows that the fully supervised U-Net performs segmentation well at extremely small dataset sizes; given only ten segmented training samples, the U-Net segments cropland at 84% accuracy. In comparison, the masked U-Net achieves a similar mean accuracy and variance after seeing between 100 to 200 pixel labels. This ratio of 10:1 to 20:1 single pixel labels to densely segmented labels holds across the curves in Figure 6, and suggests that pixel-labels are preferable to segmented labels if they are less than 10-20 times as costly to obtain. Here cost should take into account not only compensation for crowdsourced workers or researchers, but also the difficulty of the labeling task, the complexity of designing the annotation instructions for training, and the likelihood that labels will meet the quality standard.
With the U-CAM model, a similar equivalence of 10:1 to 20:1 between image labels and densely segmented labels exists, until the performance of the U-CAM model flattens out after 500 image labels at 0.87. Additional labels beyond 500 do not help the model better localize the precise location of cropland pixels. Therefore, to achieve the highest segmentation accuracies, image labels may need to be augmented with pixel labels or densely segmented labels. One can imagine pre-training on a large number of image labels and fine-tuning on a small number of segmented labels. More research is needed to improve the localization of segmentation predictions transferred from image classification.

Method Limitations and Future Directions
By training U-Nets on an annual, median composite of the first seven Landsat bands, our work does not leverage the temporal nature of satellite imagery or commonly-used vegetation indices (VIs) like NDVI. Since the timing of plant growth and senescence helps distinguish different types of vegetation, features that capture variation in time should be an improvement over the annual median. Future work can explore segmentation using weakly supervised CNNs or RNNs with temporal features, especially in ways that are label-efficient. As for the use of vegetation indices, nonlinear methods like neural networks and random forests should in theory be able to recover them if given enough data, though adding VIs may still improve performance, especially at small training set sizes. The goal of this study is not to create the best possible cropland map, but to demonstrate that CNNs can perform segmentation of remote sensing imagery with weak labels, which have traditionally been used only to train pixel-based machine learning methods.
While we have shown that deep learning methods can achieve state-of-the-art accuracies on segmentation using weak labels, the application of CNNs to remote sensing tasks still contains trade-offs relative to more established machine learning methods (i.e., our baselines). First, training CNNs on remote sensing datasets and applying them at a large scale currently requires the user to move large quantities of data between GIS platforms (in our case, Google Earth Engine) and deep learning frameworks (TensorFlow, PyTorch, etc.). Further integration of these platforms will alleviate the manual manipulation of geospatial data and go a long way toward enabling the application of deep neural networks in this domain.
Second, deep learning models still suffer from a shortage of tools that enable users to qualitatively understand the relationship between inputs and the model's prediction. In applications where machine predictions feed into human decision-making, this lack of interpretability decreases trust in neural networks and may make them less suitable than highly interpretable models like logistic regression. More visualization tools and theory are needed to improve the transparency of deep learning; in the meantime, performance and interpretability should continue to be viewed as a trade-off when selecting between machine learning algorithms.

Conclusions
In this paper, we showed that the U-Net model, designed for end-to-end segmentation, can segment cropland in Landsat composite imagery over the US Midwest using small quantities of weakly supervising labels. The masked U-Net, trained on pixel labels, and the U-CAM model, trained on image labels, achieve segmentation accuracies exceeding 85% on training set sizes in the hundreds of labels. They outperform traditional machine learning baselines trained on the same quantities of labels (above n = 100), and show greater spatial coherence in their predictions.
Our work demonstrates that CNNs can be trained to perform accurate segmentation with weak supervision, using ground truth labels that contain less information per label than densely segmented ones but are easier to obtain in large quantities. This enlarges the possibilities of methods that can be used with existing point or image labels, plus future such datasets generated from fieldwork or crowdsourcing. Further work is needed to bridge the gap between the data requirements of state-of-the-art machine learning methods and the data availability in many remote sensing applications, as well as integrating GIS data platforms with deep learning frameworks in order to apply these methods at large scale.

Acknowledgments:
We thank Nick Guo for providing technical support on the Google Cloud Platform and data storage.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Since satellite readings and cropland labels are highly correlated in space, metrics like accuracy will be inflated if individual samples are split into training, validation, and test sets at random. We therefore split the study region into 64 rectangles geographically, and randomly assigned 50 rectangles to a training and validation set and 14 rectangles to a test set. Within the training and validation set, the 50 rectangles were split into 10 folds of 5 rectangles each. Nine of the ten geographic training, validation, and test set splits are shown in Figure A1. Figure A1. Nine of the ten folds showing training, validation, and test set splits. Folds are defined geographically to reduce performance inflation due to spatial autocorrelation. The first fold is shown in Figure 2b.

Appendix A.2. Baseline Model Descriptions and Implementation Details
• Logistic regression. Logistic regression is a commonly used classification method that uses a logistic function to model a binary outcome. The predictors are assumed to have a linear relationship with the logarithm of the outcome odds. Since it can only learn linear decision boundaries between classes, logistic regression performs poorly when class boundaries are highly non-linear but well when they are approximately linear, and can outperform non-linear methods when predictor dimensionality is high relative to number of data points. • Support-vector machine (SVM). SVMs are a class of models capable of performing non-linear classification. They do this by constructing hyperplanes that separate the training set into classes in high or infinite dimensional space with the largest margins possible. They have been used with success in remote sensing to perform land cover classification and crop mapping [49]. • Random forest. Random forests are an ensemble machine learning method in which many decision trees are aggregated to perform classification or regression [50]. They are used frequently in the field of remote sensing to perform land cover classification and crop mapping [51,52], and have been shown to yield higher accuracies than maximum likelihood classifiers, support vector machines, and other methods for crop mapping [49,53,54].
We used Python's scikit-learn [55] implementation of logistic regression with penalty, support vector machines, and random forest classifiers. At each dataset size, we performed a 10-fold cross-validation to find the best hyperparameters. The hyperparameters that yielded the highest mean validation accuracy for each method and dataset size are shown in Tables A1-A3.
All baseline models were run on a Google Compute Engine virtual machine with 4 Intel Broadwell vCPUs and 52GB RAM, running Ubuntu 16.04. We used Python 3.7.3 and scikit-learn 0.21.3.

Appendix A.3. U-Net Implementation and Hyperparameter Search Details
Neural networks were trained on the same Google Compute VM as the baselines, using Nvidia K80 GPUs. The densely supervised and masked U-Nets were implemented in PyTorch 1.2.0, and the U-CAM model was implemented in TensorFlow 1.4.1. We used an Adam optimizer with learning rate 10 −3 , β 1 = 0.9, β 2 = 0.999 , and batch size of 32. Batch normalization was used after each convolutional layer with batch norm momentum of 0.9. The U-Nets were trained for 20 epochs to allow convergence on training set sizes n ≥ 1000, and trained for 200 epochs at n = 10, 100 epochs at n = 20 and n = 50, and 50 epochs at 100 ≤ n < 1000.
To curb overfitting, we added an L2 regularization term to our cross entropy losses so that our final training loss was J(θ, λ) = L(θ) + λ||θ|| 2 2 (A1) Note that only model weights were penalized; biases were not. Optimal U-Net hyperparameters were found via grid search and are shown in Table A4. The hyperparameters we searched over are the number of encoding and decoding blocks l, the number of filters f , and regularization strength λ. Table A4. Hyperparameters for U-Net yielding the highest validation accuracy across dataset sizes.
We observed that model depth and width did not significantly affect end-to-end segmentation, suggesting that the task of mapping cropland using a Landsat composite is simple enough to be performed well with the smallest of these models (3 blocks, 16 filters in the first block). For transferring image classification to segmentation, however, deeper 4 or 5 block U-Nets with 32 to 64 starting filters achieved the highest segmentation accuracy. Optimal L 2 regularization strength varied from 10 −4 to 10 −3 depending on training size.

Appendix A.4. Random vs. Deterministic Masking
Randomness in the position of the labeled pixel was important for achieving high correlation between validation task accuracy (single pixel classification) and segmentation accuracy. Figure A2 shows the correlation between validation set task accuracy and segmentation accuracy for the two types of labels. While random label position achieves an R 2 close to 1.0, a label position that is always in the center of the tile achieves R 2 = 0.787. A lower R 2 means there is less of a guarantee that a model that classifies the center pixel correctly also segments an entire tile correctly, making it more difficult to select a good model during cross-validation.
(a) (b) Figure A2. Scatter plots and corresponding least squares fit between validation set task accuracy and segmentation accuracy for (a) randomly located pixel labels and (b) pixel labels always at the center of the tile for the masked U-Net model. Points are shown for training set sizes of n ∈ {100, 1000, 10,000} across 10 runs of [50, 20,20] epochs, respectively. Appendix A.5. Baseline Model Results Figure 6 summarizes the performance of baseline and oracle methods across a wide range of training set sizes. Hyperparameters of the baseline models were tuned for each training set size and are listed in Tables A1-A3. Of the three pixel-based baseline methods (Figure 6a), SVM achieved the highest classification accuracies consistently across different values of n (85.4% mean accuracy at n = 1000), while random forest accuracies were close behind (84.0% mean accuracy at n = 1000). Accuracies for both methods increase with training set size up to the largest size of n = 10 5 , though the increase slows at larger n. In contrast, logistic regression performs significantly worse than the nonlinear baselines and reaches its highest accuracy of 81% by n = 2000. The downside of SVM is that its O(n 2 ) computational complexity makes it prohibitively time-consuming to train, so we do not report SVM accuracies at n > 10 4 .
Due to high SVM runtime, we only evaluated logistic regression and random forest baselines for image-level labels, where 1 image label corresponds to 2500 pixels. Memory constraints also limited these two methods to training set sizes of up to 20,000 images (50 million pixels) and 2000 images (5 million pixels) respectively. Figure 6b shows their performance as the number of image labels increases. While random forest accuracies are on average worse than logistic regression when the training size is very small, the forest begins to capture nonlinearities and surpass logistic regression when shown 100 or more images.