Extrapolating Satellite-Based Flood Masks by One-Class Classification—A Test Case in Houston

Flood masks are among the most common remote sensing products, used for rapid crisis information and as input for hydraulic and impact models. Despite the high relevance of such products, vegetated and urban areas are still unreliably mapped and are sometimes even excluded from analysis. The information content of synthetic aperture radar (SAR) images is limited in these areas due to the side-looking imaging geometry of radar sensors and complex interactions of the microwave signal with trees and urban structures. Classification from SAR data can only be optimized to reduce false positives, but cannot avoid false negatives in areas that are essentially unobservable to the sensor, for example, due to radar shadows, layover, speckle and other effects. We therefore propose to treat satellite-based flood masks as intermediate products with true positives, and unlabeled cells instead of negatives. This corresponds to the input of a positive-unlabeled (PU) learning one-class classifier (OCC). Assuming that flood extent is at least partially explainable by topography, we present a novel procedure to estimate the true extent of the flood, given the initial mask, by using the satellite-based products as input to a PU OCC algorithm learned on topographic features. Additional rainfall data and distance to buildings had only minor effect on the models in our experiments. All three of the tested initial flood masks were considerably improved by the presented procedure, with obtainable increases in the overall κ score ranging from 0.2 for a high quality initial mask to 0.7 in the best case for a standard emergency response product. An assessment of κ for vegetated and urban areas separately shows that the performance in urban areas is still better when learning from a high quality initial mask.


Introduction
Satellite-based flood mapping is a central topic in applied remote sensing, due to the high relevance of accurate event maps in all phases of the disaster risk management cycle. Besides the use during emergency response, the observed flood extent is often necessary for post-event analysis, including modelling studies. An emerging field is also the assimilation of flood extents in near-real-time into hydrodynamic models [1]. The term flood mask refers to a binary geospatial data layer of flood water extent, where the permanent water bodies are excluded. Most products are currently based on synthetic aperture radar (SAR) sensors, which can operate day and night, independent of cloud cover. As the temporal coverage and free-of-charge availability of satellite imagery steadily increases, flood masks of varying quality and file format are produced, for example, by the Copernicus Emergency Management Service of the European Commission (EMSR). However, there are still obvious validity layer. This leads us to formulate our research question as a one-class classification (OCC) problem. OCC algorithms require only one class to be labeled, termed the positive (P) class. They may use either only P or PU training data, thereby avoiding to wrongly treat unknown labels as true negatives. Such methods are commonly used in habitat modelling [36] as well as for specific remote sensing questions like mapping raised bogs [37], invasive tree species [38], Bark Beetle infestation [39] or damaged maize fields [40]. Mack & Waske [41] investigated the discriminative power of the well-known PU algorithms MaxEnt [36] and Biased Support Vector Machine (BSVM, [42]) in comparison to a P classifier and a PN benchmark model for a variety of classification tasks. PU learning is generally considered more promising than P learning, especially when classes are not perfectly separable, because PU algorithms may learn about the overall distributional characteristics. When using an OCC on satellite-based flood masks, there is no need for a validity layer, as long as false positives have been minimized during the creation of the flood mask (depending on the algorithm, some violation of this assumption is acceptable).
The aim of this study is to improve satellite-based flood masks by reducing false negatives in areas where the satellite sensor has low sensitivity, such as vegetated and urban areas. Our investigation requires a flood event covered by multiple satellite-based flood masks of different quality, relatively high resolution topography, gridded rainfall measurements, and mapped building footprints. Additionally, we use high-quality flood extent maps ("ground truth") for testing the performance of the proposed approach. We chose the well-documented event of 2017 hurricane Harvey in Houston, TX, as test case. We present a novel methodology for extrapolation by OCC and test it with three different initial satellite-based masks on different spatial scales. The paper is organized as follows: Section 2 gives a description of the flood event and used datasets, followed by details on the algorithms, performance metrics, and experimental setup. In Section 3, the skill of the BSVM and MaxEnt models is compared, and the effect of a region-growing postprocessing is quantified. Example maps of spatial predictions are shown for selected models. The results are then discussed in a broader context in Section 4.

Study Area and Datasets
Hurricane Harvey ranks among the costliest disasters that have affected the United States during the last decades [43], with Houston in particular suffering severe damage in the final days of August 2017. Although considered primarily a pluvial flood event, with implications for modelling [44], the vast spatial extent and long duration of the rainfall also caused all major river basins to overflow. According to the Harris County Flood Control District (HCFCD), 70,370 out of 154,170 flooded homes were located beyond the official 500-year flood hazard zone [45]. Water levels in the San Jacinto River exceeded all historical records, with estimated return periods above 500 years in many places. In the western part of Houston, two large-scale flood control structures, the Barker reservoir and the Addicks reservoir ( Figure 1) were forced to open their release gates on 28 August, but the water level within continued to rise until August 30 to the point of local overtopping, despite the open gates [46]. The combined outflow of both reservoirs led to a massive flooding of the Buffalo Bayou. It is reported that about 14,000 homes were even located within the reservoirs themselves.

Flood Masks for Training and Validation
The following products were used in our study as initial flood masks for training the OCC models: The EMSR released a mapping of areas inundated by Hurricane Harvey on 31 August 2017 (EMSR_229), based on Cosmo-SkyMed data. This is a typical standard product, designed for rapid response. The EMSR_229 mask covers the entire urban area of Houston and surroundings. Li et al. [20] further classified parts of a Sentinel-1 scene from August 30th, including interferometric coherence with previous scenes, by a Bayesian Network fusion technique (DLR_BN). Li et al. [21] also processed TerraSAR-X images by a convolutional neural network (DLR_CNN), with the flooded scene dating to September 1st. The latter is only available for a rather small region within the city, along the Buffalo Bayou. Both DLR_BN and DLR_CNN can be regarded as "high quality" masks, with reported κ coefficients of 0.68 in both cases from comparison to a labeled aerial image. However, we observed some flaws in this labeling when comparing it to the raw aerial image. Validation in our study is based on two independent products: First, we downloaded the original 50 cm resolution aerial image acquired by the National Oceanic and Atmospheric Administration (NOAA) on 30 August 2017, accessed on 3 December 2020. (https://storms.ngs.noaa.gov/storms/harvey/index.html#9/29.8430/-95.0729) and manually labeled all flooded areas on the image (NOAA_labeled) in three categories: open flood water, flooded vegetation, and flooded urban area. The land cover classes allow for calculating the model skill in a stratified manner, providing numbers for vegetated and urban areas separately. The guiding principle for assigning these land cover classes was to consider what is visible from the point of view of a satellite. Small patches of open water within built-up environment were still labeled "urban", as the SAR signal in these locations would most likely be influenced by the surrounding buildings. The main channel of the Buffalo Bayou was labeled "vegetation", as there are mainly tree canopies visible from a satellite's perspective. Great care was taken to only include buildings in this reference map where it was obvious, for example, from the color of the swimming pools, that at least the ground floor of the building got affected-otherwise we only delineated the visible water on the roads. Permanent lakes within the urban areas were intentionally not mapped, only the flood waters surrounding the regular lake extents. While some residual ambiguity remained between the assigned land cover classes, especially between open water and flooded vegetation inside the large reservoirs, we are confident that this manually labeled image is a very precise reference for the situation on 30 August 2017. This reference map is publicly available as online supplement to this publication. Secondly, we obtained a mapping by the United States Geological Survey (USGS) for the San Jacinto River (USGS_SJ). The USGS has released flood extents for major river catchments [47], based on interpolated field measurements of high water marks (HWM), which have been used by the company Fathom [44] as "ground-truth" for validating their hydraulic model simulation of the event. Watson et al. [47] acknowledge that some uncertainties remain in areas where the coverage of the HWM is sparse and that the mapped boundary was manually extended to anthropogenic structures such as roads or bridges. We overlayed all masks with OpenStreetMap (OSM) water layer (http://hydro.iis.u-tokyo.ac.jp/~yamadai/OSM_water/, accessed on 20 July 2020), which includes categorized water bodies in high spatial detail, and removed all of these areas from the masks, thereby equally converting all masks to flood masks.

Explanatory Features
An overview of used datasets and features is given in Tables 1 and 2. A digital elevation model (DEM) called the National Elevation Dataset (NED) is available from the USGS, based on the best available data source per area [48]. We used the 1/3 arc seconds (~10 m) version. From the DEM, different features have been derived: slope, curvature, topographic wetness index (TWI) and topographic position index (TPI). The TPI is a geomorphological measure derived by focal window operations, which in machine learning terminology can be considered a manual convolution on the DEM. TPI has a clear physical meaning, as it indicates local hills and depressions. Combining TPI on multiple scales allows for identifying more complex landscape morphologies [49]. We used the implementation in the R library spatialEco [50] and computed TPI on the scale of 11, 51, and 101 cells, which corresponds to about 50, 250, and 500 m in all directions. The OSM water layer distinguishes 5 types of water bodies in this area, namely "Ocean", "Large Lake & River", "Major River", "Small Stream" and "Canal". We discarded the ocean and merged "Small Stream" and "Canal" as these labels appeared to have been used interchangeably from visual inspection in the Houston area. This left us with three different stream layers, for which we computed the HAND and Euclidean distance separately (by GRASS r.watershed and GDAL Proximity). OSM buildings had very limited coverage in Houston at the time of this study, therefore we used Microsoft USBuildingFootprints (https://github.com/microsoft/USBuildingFootprints, accessed on 26 August 2020). The Euclidean distance was computed on rasterized shapes, which corresponds to the distance to the closest building cell. Gridded rainfall data was downloaded from the US National Weather Service (NWS) website (https://water.weather.gov/precip/download.php, accessed on 25 August 2020). We used the sum of 26-30 August, where most of the rainfall occurred in Houston. The accumulated rainfall was computed via the GRASS GIS tool r.accumulate, with the rainfall sum as input. Features were separated into three groups for our experiments. Most features were derived from the DEM and/or stream location data, and therefore called "Topo" features. These were always used. The "Rain" and "Buildings" features were added separately to test the effect of the additional data. To keep it simple during processing, we resampled all datasets to the resolution of the DEM, so that all layers could be converted to a raster stack.  Two commonly used PU learning algorithms are tested in this study for the purpose of extrapolating satellite-based flood masks from the abovementioned features. BSVM [42] is a discriminative algorithm, originally developed for text classification. It was found superior to previous multi-step OCC procedures, and also to other P and PU learners, for classification of remote sensing images [41]. Essentially, it is a support vector machine with radial basis function (RBF) kernel and unequal misclassification penalty terms in the cost function. By assigning higher penalty to misclassified positive samples, the unlabeled samples are considered "negotiable" during training. The biased cost function is given as Equation (1) Minimize where C + and C − are the cost of misclassification for positive and unlabeled samples, respectively. C + is in practice parametrized by C Multiplier times C − . w is the weight vector, x is the feature vector, and y is the corresponding label. ξ is the slack variable used to evaluate potential hyperplanes for k-1 positive and k-n unlabeled samples. Superscript T denotes the inner product. The so constructed hyperplane has by definition a value of 0, which can be regarded as the "default" threshold (θ De f ault ) for binary classification of BSVM. The continuous output of BSVM gives the distance to this hyperplane, where higher values indicate samples associated with the positive class (i.e., flood in our case), and lower values indicate samples associated with the negative class. However, a threshold for binary classification can be set by the user at any value, and it is sometimes recommended in the literature not to rely on the default in application, for example [41]. For the PN benchmark models, we used a regular (unbiased) support vector machine (SVM), in which the misclassification costs for both classes are equal.
MaxEnt [36] is a generative algorithm with solid roots in information theory and probabilistic reasoning [51]. The implementation in a stand-alone software, which is now open source [52], is commonly used in ecological modelling as well as for mapping rare land cover classes. The developers phrase the objective of the maximum entropy principle as estimating a distribution that agrees with everything that is known, and at the same time avoiding any assumptions about what is unknown. More specifically, the procedure searches for a Gibbs distribution, under the constraints that the expectation of every feature corresponds roughly to the empirical feature mean, while pertaining a shape as close to the prior distribution as possible. MaxEnt internally computes variance features, product features, threshold features, and hinge features. This allows the algorithm to learn complex responses and interactions, but requires regularization to avoid overfitting. The optimal value of the regularization parameter β is accordingly determined over a grid search. Note that the original formulation by [36] is in geographic space, and in that space the prior distribution is a uniform distribution, that is, all locations are a-priori equally likely to contain the positive class. More in line with machine learning literature is the formulation in feature space, where the prior is the marginal feature distribution, and MaxEnt estimates the distribution of the positive class by minimizing the relative entropy (Kullback-Leibler divergence) between the positive and marginal distributions under the constraints imposed by the feature means [53]. The formulation is unconditional, so that only positive and unlabeled data is required. In other words, MaxEnt models the ratio of presence to background, which results in a relative probability. The cost function Equation (2), in the notation of the authors, can be shown to be the negative log-likelihood with an L1 penalty term.
whereπ is the prior distribution,π the resulting MaxEnt distribution, q λ the Gibbs distribution, square brackets [] denote the expectation, ln the natural logarithm, β is the cost parameter, and λ the weights, over j features f.
The result of MaxEnt is a relative occurrence rate, sometimes termed "suitability", which can be obtained in different transformed (monotonically related) output formats. Similar to [39], we use a value 0.5 on the so-called logistic output format as "default" threshold for MaxEnt, because this is the default value for the internal parameter used to create the logistic output [53]-despite strong arguments in the literature stating that this output format should not be carelessly treated as absolute probability of presence [54,55]. This theoretical issue is not of interest to us here, since we do not apply any probabilistic interpretation. For further mathematical details, the interested reader is referred to the abovementioned original literature. In this study, we relied on the R library oneClass (https://github.com/benmack/oneClass, last access on 05 October 2020), which contains a BSVM implementation, as well as an R wrapper of MaxEnt that calls the Java source file. Both implementations internally scale the data.

Post Processing by Region Growing
To restrict the predicted flood extent to those areas that have a spatial connection to the initial extent, we applied the ConnectedThreshold method from the Python module SimpleITK [56]. The procedure starts at given seed points and checks whether neighboring raster cell values fall within or outside a user-defined range. As seed, the original SARderived flood extents were used. If a cell is discarded, its neighbors are not considered and the propagation in that direction stops. When providing a binary raster (dry or flooded, denoted as 0 or 1) and setting the user-defined threshold to 1, then the result is simply a cut-back binary raster, on which all flood cells unconnected to the initial flood extent are reset to non-flooded.

Performance Metrics
Two different types of metrics are needed for this study: training metrics based on PU data to select the best model during the parameter grid search, and validation metrics based on the PN reference to evaluate the final extrapolations. With only positive and unlabeled data, the quantities that can reliably be estimated are the True Positives (TP, prediction and observation are positive), the False Negatives (FN, prediction misses positive observation), and the model's probability of positive predictions among all predictions. From these quantities, various metrics have been proposed in the literature (see e.g., [57,58]). However, most of these metrics are depending on the binarization threshold. For thresholdindependent evaluation of binary classifiers, it is common to compute the area under the curve (AUC) of the receiver-operator characteristic (ROC) [59,60]. The AUC indicates how well the algorithm ranks the instances. For PU data, the best obtainable AUC value is theoretically lower than 1, as some unlabeled samples should get ranked among the positive class, but Phillips et al. [36] have claimed that the difference in AUC PU is still a valid measure to compare the discriminative power of multiple models. In line with Phillips et al., we argue that AUC PU is a consistent metric for model selection, as it has the same meaning for any algorithm (BSVM has a different default threshold than MaxEnt), and is adequate for any purpose. The user can later decide to put more emphasis on sensitivity or specificity during threshold selection, depending on the intended application of the model. We verified that AUC PU indeed correlates with AUC PN , which denotes the same metric based on PN reference data ( Figure 2). While even high PU performance is no guarantee for high PN performance, and the very best model on test set might not be the rank #1 on training set, AUC PU generally selects good models, which makes it a reasonable choice in the absence of PN test data. This behavior has been previously reported by [58], who suggest a manual inspection of several candidate models. However, as we present a method rather than a specific classification, manually inspecting several candidate models for each experimental setup was deemed unfeasible and too subjective for a methodological study. It is worth to note that we have conducted similar checks with other PU metrics in the early stage of this study, but only present AUC PU here, due to the abovementioned consistency of this metric. Validation metrics for PN data are more standard. We measure the commonly used κ score by Cohen [61] as well as the sensitivity (true positive rate, Equation (3)), specificity (true negative rate, Equation (4)), and error bias (EB, Equation (5)). To evaluate the initial masks, we further provide the percentages of detected open water, flooded vegetation, and flooded urban areas. The PN performance is given for the entire images, that is, all pixels, stratified by the manually assigned land cover class.

Experimental Setup
The presented extrapolation procedure by OCC, as visualized in Figure 3, works in four steps plus validation: (1) feature engineering, by which we mean the derivation of explanatory variables (e.g., topographic indicators) from the raw data (e.g., DEM); (2) Training data sampling; (3) model learning; and (4)  This results in a single raster with continuous values, which represent the raw output of the algorithms (i.e., distance to the hyperplane for BSVM, and relative probability for MaxEnt) for each raster cell. To obtain a binary prediction (flooded or not), a threshold has to be applied to this continuous prediction. Subsequent region-growing removes areas without connection to the initial mask, which makes the result appear like an inter-/extrapolation. The binary predictions, raw and postprocessed, are then validated by comparison to the independent reference maps NOAA_labeled and USGS_SJ. The difference between the binary predictions and corresponding binary reference results in a validation map with the 4 classes TP, FP, TN and FN. Samples for training the models were drawn from the valid extent of the respective initial mask, that is, AOI1 for DLR_BN and NOAA_labeled, AOI2 for DLR_CNN, and AOI3 for the USGS_SJ benchmark. In the case of EMSR_229, which has by far the largest extent, it was tested how the sampling area during training affects the skill. Eventually we used the entire area covered by the feature stack as training area ("Full Extent") for the presented results.
OCC methods have been applied to problems with very few positive training samples, because these occur rarely or are expensive to obtain. In our case, obtaining positive samples does not constitute a problem since we can potentially use the entire flood extent as training area. The number of unlabeled samples should be high enough so that the feature space during training is representative for the feature space in the application case, that is, more is better, limited only by concerns about computation time [58]. For each PU classification problem, we randomly sampled (without replacement) 2000 positive and 8000 unlabeled pixels. The PN benchmark models were trained with 5000 positive and negative samples each. Further, we tested two sampling modes, named "regular" and "urban". In regular mode, samples were drawn entirely random. In urban mode, samples were drawn in equal parts from a distance up to 20 m, 100 m and above 100 m distance to buildings. The idea behind this urban sampling was to provide the algorithms with more of those samples which we consider to be difficult and of primary interest. DLR_CNN, like the manually labeled reference, contains distinct labels for flooded open water and flooded urban areas, so in that case for the urban mode we instead only used the urban class.
Models were further trained on four different feature subsets as denoted in Table 2, guided by the question of potential application. Both algorithms use regularization, so theoretically there is no need for manual feature selection. However, models including rainfall or distance to buildings require that additional data to be available, and might potentially learn different types of patterns. Therefore we investigated these choices separately. The subsets are: only topographic data and distance to streams ("Topo"), the aforementioned plus rainfall data ("Topo+Rain"), topographic data plus distance to buildings ("Topo+Buildings") and all data combined ("All").
For the sake of providing consistent numbers, two thresholds were considered for all models: the default (θ De f ault ), that is, 0 for BSVM and 0.5 for MaxEnt, which is learned from the PU training data, and the optimal threshold (θ Opt ) at maximum κ, which requires PN reference data. In practical application, the user would most likely inspect the continuous prediction of the best models (selected by AUC PU ), before deciding on the threshold. However, as we present a novel procedure here, we cannot inspect all models in detail and want to provide the maximum obtainable skill.

Skill of the Initial Masks
To evaluate whether the proposed procedure is able to improve the initial masks, we first quantified the quality of the original products by the same measures as used for the models and using the same reference data (Table 3). EMSR_229, despite detecting essentially no flooded vegetation or urban areas at all, still obtains a tolerable accuracy score, due to its outstanding specificity (0.999), that is, no false positives. The higher overdetection in the San Jacinto area might also hint at errors in the USGS_SJ reference. DLR_BN and DLR_CNN also exhibit 0.99 and 0.98 specificity, respectively, while detecting just 20%-40% of the flooded vegetation and urban areas. This clearly underlines our hypothesis, that these products should be regarded as positive and unlabeled. EB consequently ranges between 0.001 for EMSR_229 to 0.13 for DLR_CNN, indicating underdetection. Note that DLR_BN only achieves an overall κ score of 0.34 (0.51 in urban areas) on our manually labeled reference, as opposed to 0.68 on the inconsistently labeled reference used in the original study by [20]. It is still a high quality product, judged by the specific skill on urban areas. Table 3. Skill of the initial masks on the AOIs used in this study. Reference data for AOI1 and AOI2 is the manually labeled aerial image NOAA_labeled, reference for AOI3 is the USGS HWM interpolation USGS_SJ. The metrics EB, Sens., Spec., Acc., and κ are calculated over all landcover classes, while κ veg. and κ urban were derived using only the flooded vegetation and flooded urban areas, respectively.

Skill of the Extrapolation Models
A full list of model setups and the threshold-independent ranking performance AUC PN , as well as the training performance AUC PU , can be found in the Appendix A Table A1. The setup of our experiments (feature selection and sampling mode) apparently had only minor impact on the results. The only remarkable finding in this context is that the spatial transfer application of DLR_CNN models to the entire AOI1 gave much better results with distance to buildings included. The best EMSR_229 models are those trained on all features, and the urban sampling mode did slightly improve these models on the urban AOI2 (Buffalo Bayou)-however, the same cannot be stated for the other initial flood masks. The effect of feature selection on the benchmark models was also negligible. We interpret this as indication that the most important features are already included in the "Topo" selection. In the following, we therefore analyze the models from different setups together, as we consider them to rather show random variation than meaningful differences. This adds a rough estimation of variance to our results and helps to visualize the effect of algorithm selection, threshold selection and postprocessing more clearly.
The κ score on validation data over all land cover classes ( Figure 4) shows that all initial flood masks can be considerably improved by the presented approach, with differences to the best models ranging from about 0.2 (DLR_CNN) to 0.6 (EMSR_229 on AOI1). Learned models are clearly performing best in their respective area of training: the West Houston AOI for DLR_BN, and the Buffalo Bayou for DLR_CNN. In San Jacinto, the best models are those learned from EMSR_229, which is the only initial mask that is defined in all three AOIs. The skill obtained when extrapolating from the EMSR product is mediocre on the Buffalo Bayou, where no flood was initially detected, better in the San Jacinto basin, and surprisingly high in West Houston. Predictions of the other models in San Jacinto, and also the application on the entire West Houston AOI for models learned from DLR_CNN, are spatial transfer. It is unsurprising that performance is lower in these cases, and not aim of the paper to improve this spatial transfer performance. The overall skill of the best extrapolation from the EMSR_229 mask on AOI2 is similar to the original DLR_BN product, and on AOI1 even competitive with the models learned from DLR_BN and DLR_CNN-however, the improvements on AOI1 stem primarily from correct detection of flooded vegetation, while the specific skill on urban flooding is still relatively low. This can potentially be explained by the fact that AOI1 is dominated by forest, while AOI2 is almost exclusively urban area, therefore the models are optimized on different conditions. It is encouraging to see that all models learned from DLR_CNN further improve this high quality initial flood mask in urban areas. Differences in κ between the best PU and PN models account to 0.15 on AOI1, 0.16 on AOI2 and 0.38 on AOI3.
At first glance, both algorithms perform similarly well, with MaxEnt often showing larger variance, meaning it appears to be more sensitive towards setup than BSVM. One notable difference is the skill on urban areas: MaxEnt models learned from DLR_BN perform worse on urban areas than the initial mask. All MaxEnt models on AOI2 perform worse than their BSVM counterparts. At the same time, performance of MaxEnt models for flooded vegetation on AOI1 is higher. Both algorithms were trained with identical data, therefore the differences have to result from the model structure. It is reasonable to assume that topography in vegetated areas behaves differently than in urban areas. The training scores (Table A1) show that BSVM in general fits closer to the training data. The initial flood masks DLR_BN and DLR_CNN already cover significant areas of urban flooding, so the close fit could be one reason for the good performance on urban areas in these cases. However, the case of EMSR_229 is less clear. A remarkable difference was observed in robustness of the optimal classification threshold ( Figure 5). The optimal threshold value for BSVM varies considerably in our experiments. This behavior may be a drawback for use cases without reference data, and for integration into automatic processing chains. MaxEnt is slightly less affected by this problem. Keep in mind, though, that the continuous output of both algorithms has a different meaning and scale (unbounded distance to the hyperplane for BSVM, and probability between 0 and 1 for MaxEnt). The average loss of skill for the PU models is below 0.1, but in individual cases considerably higher. The suitability of the default threshold may dependent on the representativeness of the training samples: For the reference models, training and application data were drawn from the same underlying distribution, and in that case θ De f ault and θ Opt are closer, with the skill being almost identical (∆κ below 0.025).
Classification on pixel level may lead to noisy results and in some cases detect possible flood in areas that were not affected by the event in question. Postprocessing, as expected, increased the specificity in tradeoff for sensitivity, but overall κ was raised as well ( Figure 6). Beyond the intended effect, we also observed significantly reduced noise from the initial mask, because random errors are unlikely to occur in the same spot twice (meaning the satellite image classification and the classification from topography as presented in this paper), so that these areas are removed. Specificity of the best EMSR_229-derived extrapolations is again close to 1 after the postprocessing, meaning that the derived flood extent is reliable. Obviously, the region-growing, which checks for connectivity with the initial flood extent, only makes sense for those areas where the initial mask is defined, not for spatial transfer (DLR_BN and DLR_CNN to San Jacinto, DLR_CNN to aerial).  Overall effect of postprocessing on κ, sensitivity and specificity at θ Opt . The range of the boxplots includes both BSVM and MaxEnt models to visualize the general trend. Empty boxes indicate that postprocessing is not possible because the initial mask does not exist on that extent. Note that EMSR_229 is theoretically defined on AOI2, but there was no flood detected in that area, therefore the region-growing would remove all predictions there.

Spatial Comparison of Predicted Flood Extents
The large-scale comparison (Figure 7) visualizes the general behavior of an OCC model learned from the EMSR initial flood mask. The initial mask used for training (green) is primarily located outside the test areas. Some disagreement between the training mask and the validation mask is visible, especially in the west. The overestimation (yellow area) is explainable given the training data, which are learned as true extent. Note that the NOAA_labeled reference has been created by us, and we are accordingly confident about the quality, while the USGS_SJ mapping on the other hand is based on interpolated high water marks and could contain errors which we cannot further evaluate. Note also that the underestimation visible on the map (red) stems to large parts from the postprocessing, which removes predicted flood without spatial connection to the initial mask. This is especially obvious for the channel of the Buffalo Bayou, which is completely missing on the postprocessed version. The continuous prediction outside the validation areas shows that the model has indeed learned quite smooth and understandable patterns along the rivers. It is also obvious that the models correctly learned to exclude the permanent river channels.  Visually disturbing is the buffer around the channels that has been classified as non-flood, which is also seen on the initial mask. This is probably an artifact from training on flood masks instead of water masks. The area around these streams is covered by dense forest. Note that the previously shown EMSR_229 model performed better along these channels, presumably because it could learn the relationships of flooding along other streams, which are less obstructed by vegetation. The DLR_BN model performs much better on urban areas, though. There is underestimation visible along the Buffalo Bayou settlements, yet the affected urban areas in the north and south-west are captured quite well. These areas are colored mainly in yellow (overdetection) because the model did not restrict the predictions to the streets, which are visible as fine blue patterns, but the affected area seems reasonable. The overestimation along the western channel is not removed during postprocessing due to spatial connection with unluckily distributed noise on the initial mask. Even with the highest quality initial mask, DLR_CNN (Figure 9), water in the streets remains mostly undetected. Still, the extrapolation outside the training area, visible in dark colors, appears smooth and connected. Noise from the initial mask has been entirely eliminated. The land-water boundary appears quite sharp.

Aim and Overall Success
The assessed satellite-based flood masks exhibit very low (EMSR_229) to moderate (DLR_BN, DLR_CNN) detection skill in vegetated and urban areas. This is to be expected, due to the various effects which constrain the information content of SAR images in these cases, that is, volume scattering, layover, oblique viewing geometries, and others. The specificity of all these products is very high, though, meaning that those areas, which are identified as flooded, indeed represent true flood. We therefore propose to treat such products as PU data. Our study demonstrates how these satellite-based flood masks can then be improved in vegetated and urban areas by an OCC procedure. A critical point for such studies is the reliability of the reference data. We present a performance evaluation on a precisely labeled aerial image, which is of higher quality than what is frequently used in other studies (e.g., [62,63]). For the larger scale, we use the extent in the San Jacinto river as published by the USGS, which is based on interpolated HWM and has been used as reference by [44].
According to the performance metrics, all initial flood masks can be considerably improved by the presented procedure. For the EMSR product, κ over all classes rose from 0.06 to 0.76 in the best case with postprocessing, and from 0.00 to 0.25 in urban areas. The high quality initial masks, DLR_BN and DLR_CNN, have also been successfully enhanced up to about 0.2 points. Although the raw classification may at first lead to some overestimation far off the initial mask, the postprocessing improved the specificity as well as the visually perceived quality of the results, by suppressing uncorrelated errors of the initial SAR classification and our classification from topographic data. In a pluvial event, the formation of disconnected puddles is possible. The region-growing may delete such correctly predicted puddles from the classification. However, if the initial satellite mask contains a single pixel of that puddle, the area is kept. Whether or not to apply this postprocessing is therefore also a question of the quality of the initial mask. Overestimation after the postprocessing occurs mainly in places where the reference mask disagrees with the input mask, meaning either false positives in the input or false negatives in the reference data. For USGS_SJ, some uncertainty is to be expected. For NOAA_labeled, minor differences might be induced by the different acquisition times of the aerial and satellite images. The models were explicitly trained on flood masks. For many applications it might be more suitable to generate water masks, which include the permanent water. This should also solve the visible underdetection along the streams in the DLR_BN models. We refer to extrapolation as growing areas of flood detected on the initial datasets. Spatial transfer (e.g., DLR_BN to USGS_SJ) did not work well. A local approach is necessary, because event characteristics differ spatially. Although some extrapolation outside the extent of the initial mask is possible, the predictions far off the original extent, and especially in different river basins, are therefore deemed unreliable (note the difference here is between undetected flood on the original extent of the satellite image, and areas outside the satellite image). Whether the area in between two or more satellite images could be modelled by this approach has not been investigated, but could be an interesting question to try in future studies.

Features and Algorithms
Our analysis was based on features that have commonly been suggested in the literature for the purpose of flood susceptibility mapping [29][30][31][32]64], like HAND, TWI, distance to streams and descriptors of the local topographic situation. The skill of the PN benchmark models suggests that these features are indeed useful, also in urban and vegetated areas, given a representative training set. In addition we tested whether rainfall data and distance to buildings help to improve the models. The rainfall sum for hurricane Harvey had very little spatial variance over the Houston area, therefore it is rather unsurprising that it does not lead to an improvement here. We hesitate to draw a general conclusion from this result, as the effect may be different for an event with more heterogeneous rainfall distribution. The results of our investigation did neither show clear improvements from using distance to buildings as a feature, nor from drawing more training samples from urban areas, when learning from the DLR_BN initial mask. However, the skill of the models learned from EMSR_229 did improve slightly, and the transfer skill of models learned from DLR_CNN to AOI1 did improve strongly, when including the distance to buildings. The sampling further seems to have at least a small effect on the skill on urban areas for the EMSR_229 models. A possible explanation is that DLR_BN already covers significant parts of urban and non-urban areas alike, while DLR_CNN covers primarily urban flood, and EMSR_229 almost exclusively non-urban flood. Therefore, the distance to buildings might be more useful in these models to describe how feature distributions of flooded areas differ closer to and further away from the city, respectively. Since the results do not indicate any negative effect of including the distance to buildings, we suggest to include it when available. To further improve the feature engineering, automating this step via deep learning might be an idea worth investigating in future studies. Especially local context features, as generated by a CNN, have been successful in improving various land cover classifications, including detection of water [65,66].
Both tested OCC algorithms, BSVM and MaxEnt, performed similarly well in the overall statistics. BSVM exhibits a closer fit to the training data, and is less affected by feature selection and sampling. Ng & Jordan [67] state that discriminative algorithms often perform better than generative algorithms for complex classification problems. This might partially explain why BSVM in most cases performs better than MaxEnt in detecting urban flooding. However, explaining this finding remains speculative to a certain extent. The best models on AOI1 and AOI2 also came close to the PN benchmark, but there is still a significant margin which indicates potential for improvement. While we assume that our positive training labels are mostly correct, there will for sure be some violation of this assumption. BSVM can theoretically handle this problem to a certain extent, because outliers in the positive training samples will be classified as negative if they are so far in the "negative realm" that the biased penalty term is overruled. MaxEnt assumes positive samples to be clean from errors [53], so a preprocessing of the initial masks might be an option to consider. Instead of performing classification in one step, it is also possible to iteratively single out the reliable negatives [41]. As the amount of available training samples in our task is relatively high, we did not implement such an iterative refinement, but rather relied of the effectiveness of data. The effect of training label distribution is debated and difficult to estimate without doing systematic tests for each dataset, as the naturally occurring class distribution-even in cases where there is such a distribution -is often not the most appropriate [68]. Besides this, also other PU algorithms are available in the literature, for example, [69,70]. If a validity layer for the initial flood mask is available, an alternative approach would be to train any regular PN classifier on the valid areas. Hydrodynamic simulations are able to model flooding in vegetated and urban areas as well, for example, Wing et al. [44] for the event in question. A drawback of our presented approach in comparison to a physical model is that the machine learning models do not account for hydrodynamic effects, or in general a closed water balance (no more water predicted than available). However, we argue that a hydrodynamic simulation could make use of the improved flood masks from our approach via data assimilation.

Threshold Selection
As this paper presents a novel approach, rather than a particular classification, we provide the threshold-independent score AUC PN , further performance metrics at θ Opt , and the loss ∆κ when resorting to θ De f ault . We are fully aware that optimal threshold selection in the absence of PN reference data is tricky. By which metric the user optimizes the threshold selection will depend on the application case, that is, how much sensitivity or specificity is required. Maximum κ may not be the desired quantity. Mack et al. [41] further suggested a manual approach (i.e., not automated) to derive a maximum a-posteriori threshold from a Gaussian mixture model analysis of the posterior density of the continuous prediction. However, that procedure is based on the assumptions that the posterior can be described by a combination of Gaussians, and that the component with the highest mean value is equal to the positive class, while all other components belong to the negative class. Another assumption in their approach is that the classes do not overlap at a specified point used to estimate the prior probabilities. These assumptions are certainly violated for some of our models, and this approach is not feasible in the context of this paper, as we compare many models to get an idea of the upper bound of performance of our procedure. MaxEnt also provides a different form of output, called the cumulative format, which allows setting a threshold based on the accepted omission rate [71]. Depending on the application, this may be a more desirable way of threshold selection. In cases where the training data is representative, the most straightforward approach is to use the default threshold or to optimize a PU performance metric of choice on the training data. For the benchmark models, training and application data were drawn from the same underlying distribution, and in that case the skill at θ De f ault and θ Opt is almost identical. This proves that the procedure is in principle able to obtain very good results, given a representative training set. In the application using satellite-based flood masks, a bias in the feature distributions of the positive training samples is to be expected, as we know that the areas detected from satellite imagery are not entirely representative of the true flood extent. Elith et al. [53] claim that PU models are even stronger affected by sample bias than PN models, because sample bias affects both positive and negative records in the PN case, but only the positive samples in the PU case. In our case, this "sample bias" corresponds to the representativeness of the initial flood mask. This leads us to assume that including additional positive class examples from within the urban area could make the positive training data more representative, and thereby improve the performance of the PU models at θ De f ault . Such data could potentially be taken from sources such as social media content or street camera footage, which is only punctually available but provides data from within the city center.

Conclusions
We presented an extrapolation technique for satellite-based flood masks to unobservable areas, by using OCC algorithms. Especially vegetated and urban areas still pose a challenge to currently available remote sensing products, the latter of which are of major importance for impact estimation. The quality of the initial EMSR_229 mask was found to be poor, detecting almost exclusively open water. Although it does exhibit very high specificity, a map with extreme specificity but very low sensitivity is trivial (only few easy-to-find spots detected) and of limited practical value. As long as the spatial validity of satellite-based flood masks is not clearly communicated, for example, by a separate validity layer, we suggest treating them as positive and unlabeled in this context. OCC is then the adequate tool, avoiding to explicitly train unobservable areas as "non-flooded". Using supervised machine learning for extrapolation is straightforward once using an OCC, as the necessary positive labels for training are readily available from the initial mask. Our procedure allows for predicting a continuous score of how likely flood is to be expected per pixel, given the original mapping and the used features. A threshold can then be applied to derive a binary classification, and a subsequent region-growing raises the specificity of the extrapolation. From the user's perspective, the presented method is relatively simple to use, as the entire initial mask can be processed without the need to exclude any areas from sampling. The most important features can already be generated from a DEM and stream locations (which can also be derived from a DEM if necessary). Distance to building footprints and gridded rainfall data did not consistently improve the results, although positive effects were observed for some models.
We conclude that all three of the tested satellite-based products have been improved to a certain extent. The absolute quality of the extrapolation, as well as the suitability of the default threshold in application, hinges on the representativeness of the initial mask. The features used in this study are not sufficient for a full separation of flooded and dry locations, but a model trained on representative training data still achieves high performance (AUC PN 0.91-0.98 in the benchmark case, 0.94 for the best PU model). The method in its current form may be useful for statistical applications on a scale where satellite imagery is utilized. It is not yet fit for analysis of individual streets, although the results with high quality input seem promising. Potential application of the presented method is not limited to masks from SAR data-it could also be used to fill holes from clouds in masks from optical data, or tested for social media derived extents. In particular, we see potential for future studies in the fusion of satellite-based flood masks with spottily mapped flood locations within a city center, for example, by social media or street camera footage. Such a fused dataset is expected to provide more representative coverage in feature space, which should lead to a more reliable default threshold. The presented approach could be tested in this direction with the aim of deriving more reliable flood extents in vegetated and urban areas. Funding: The research presented in this paper was conducted in the framework of the project "Multirisk analysis and information system components for the Andes region (RIESGOS)" funded by the German Ministry of Education and Research (BMBF); (contract no: 03G0876B).
Institutional Review Board Statement: Not applicable.

Acknowledgments:
The authors wish to thank Yu Li for providing the high quality flood masks, all abovementioned providers of publicly available datasets, as well as Oliver Wing of FATHOM and the company JBA for access to hydraulic simulations that were used for comparison in the early stage of this study.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. All model setups and threshold-independent ranking skill. Setup IDs 1-8 have been excluded for the plots in the main publication, to ensure the same number of points for all initial flood masks.

Setup ID Algorithm Flood Mask Training Extent Sampling
Features AUC PU AUC PN -AOI1 AUC PN -AOI2 AUC PN -AOI3