A 30 m Resolution Surface Water Mask Including Estimation of Positional and Thematic Differences Using Landsat 8, SRTM and OpenStreetMap A Case Study in the Murray-Darling Basin, Australia

: Accurate maps of surface water are essential for many environmental applications. Surface water maps can be generated by combining measurements from multiple sources. Precise estimation of surface water using satellite imagery remains a challenging task due to the sensor limitations, complex land cover, topography, and atmospheric conditions. As a complementary dataset, in the case of hilly landscapes, a drainage network can be extracted from high-resolution digital elevation models. Additionally, Volunteered Geographic Information (VGI) initiatives, such as OpenStreetMap, can also be used to produce high-resolution surface water masks. In this study, we derive a high-resolution water mask using Landsat 8 imagery and OpenStreetMap as well as (potential) a drainage network using 30 m SRTM. Our approach to derive a surface water mask from Landsat 8 imagery comprises the use of a lower 15% percentile of Landsat 8 Top of Atmosphere (TOA) reﬂectance from 2013 to 2015. We introduce a new non-parametric unsupervised method based on the Canny edge ﬁlter and Otsu thresholding to detect water in ﬂat areas. For hilly areas, the method is extended with an additional supervised classiﬁcation step used to reﬁne the water mask. We applied the method across the Murray-Darling basin, Australia. Differences between our new Landsat-based water mask and the OpenStreetMap water mask regarding positional differences along the rivers and overall coverage were analyzed. Our results show that about 50% of the OpenStreetMap linear water features can be conﬁrmed using the water mask extracted from Landsat 8 imagery and the drainage network derived from SRTM. We also show that the observed distances between river features derived from OpenStreetMap and Landsat 8 are mostly smaller than 60 m. The differences between the new water mask and SRTM-based linear features and hilly areas are slightly larger (110 m). The overall agreement between OpenStreetMap and Landsat 8 water masks is about 30%.


Introduction
High-resolution measurements of geometry and dynamics of surface water are essential parameters required for many environmental applications, such as flood forecasting and warning, The volunteered nature of OSM is the main factor that makes it less adopted by GIS professionals [27] stressing the importance of the development of automated methods and tools to validate its quality in comparison to other datasets. A good example of how OSM quality can be assessed can be found in [28], showing how positional differences between linear and polygonal features can be addressed. Furthermore, the "increasing buffer" method [29] can be used to estimate the quality of linear features [24]. An excellent overview of papers focusing on OSM quality can be found in [30]. The latter also suggests using elements of ISO 19113:2002 [31] (such as completeness, an error of commission/omission, or positional differences) to evaluate the OSM quality. Unfortunately, to our knowledge, no academic literature exists focusing solely on the quality of water features present in OSM and using global or nearly global remote sensing datasets.
The main reason why OSM was chosen in the current study over local, authoritative datasets, is that it provides a global coverage, even though its local coverage and quality may vary. Additional research would be required to perform a detailed comparison of the datasets presented in this paper to the local Australian authoritative datasets, such as Surface Hydrology [32], Water Observations from Space [33], or 5 m Digital Elevation Model (DEM) of Australia [34].
We derive a high-resolution water mask using Landsat 8 imagery and OpenStreetMap, as well as a (potential) drainage network using 30 m SRTM. Extracting a water mask from OpenStreetMap data is relatively straightforward (Section 2.2), but the other data sources require a specialized workflow. Our approach to derive a surface water mask from Landsat 8 imagery is described in Section 2.4, involving the steps to compute cloud-free average reflectance composites in Section 2.4.1. Additionally, we introduce a new non-parametric unsupervised method to detect water in flat areas (Section 2.4.2). We also propose a supervised classification step to refine the water mask in hill areas (Section 2.4.3).
We make use of several open geospatial and remote sensing datasets to construct an open water map. Section 2 provides an overview of the main input datasets utilized in the study, as well as methods applied or developed to detect water, and to compare the resulting water masks. The main input datasets include (1) images acquired by the NASA Landsat 8 mission [35]; (2) a new revision of a nearly-global 30 m DEM, measured by the SRTM mission [36]; and (3) OSM data for the Murray-Darling basin in Australia.

Study Site: Murray-Darling River Basin
The Murray-Darling Basin [37], named after the two main rivers in the basin, is located in semi-arid and arid climate zones, covering 1,059,000 square kilometers, an equivalent of 14% of Australia's total surface area. OSM water features include 10,106 linear and 6708 aerial hydrographic features representing both natural and manmade water features, such as rivers, lakes, and canals ( Figure 1). The features also include water bodies, which are only partially covered by water during the year. Figure 1 shows both hydrographic features extracted from OSM and the (potential) river network obtained from HydroSHEDS [18].
The Murray-Darling Basin receives most of its rainfall from a very small percentage of the basin area; mainly along the southern and eastern parts. The rest of the basin is flat and low-lying, contributing very little or no run-off to the rivers. Relevant for this study, many tributaries within the basin are intermittent streams, with highly variable flows dependent on the wetness of the year.

Input Datasets Used to Extract the Water Mask
Even though HydroSHEDS provides a much better coverage, its resolution is limited to 15 arcseconds (~450 m), which is insufficient to resolve all of the small water features required for detailed water-related applications. Furthermore, it was based on an older, 90 m revision of SRTM, which is less accurate when comparing to a recently-released 30 m version. Therefore, to develop a method to generate the water mask, we have used four raw input datasets (see Table 1 and Figure 2). The first dataset, Landsat 8, combines 2743 scenes of optical multi-spectral satellite imagery acquired during 2013-2015 over the study area. The Landsat 8 mission has already operated for more than two years and, therefore, provides a reasonably long period of information to estimate water dynamics on a global scale. The second dataset is based on the new 30 m revision of the SRTM digital elevation model. The last data source consists of all water features, extracted from a recent version of OSM Planet File [39]. Additionally, HydroBASINS [40] was used as a supplementary dataset to simplify the extraction of a drainage network from SRTM.

Input Datasets Used to Extract the Water Mask
Even though HydroSHEDS provides a much better coverage, its resolution is limited to 15 arcseconds (~450 m), which is insufficient to resolve all of the small water features required for detailed water-related applications. Furthermore, it was based on an older, 90 m revision of SRTM, which is less accurate when comparing to a recently-released 30 m version. Therefore, to develop a method to generate the water mask, we have used four raw input datasets (see Table 1 and Figure 2). The first dataset, Landsat 8, combines 2743 scenes of optical multi-spectral satellite imagery acquired during 2013-2015 over the study area. The Landsat 8 mission has already operated for more than two years and, therefore, provides a reasonably long period of information to estimate water dynamics on a global scale. The second dataset is based on the new 30 m revision of the SRTM digital elevation model. The last data source consists of all water features, extracted from a recent version of OSM Planet File [39]. Additionally, HydroBASINS [40] was used as a supplementary dataset to simplify the extraction of a drainage network from SRTM.
Extraction of water features from OSM is a relatively trivial task, mainly consisting of a careful selection of proper filters and data conversion tools. We have used data conversion tools provided by OSM and a set of programs based on GDAL [41] to select only relevant features representing surface water.
To derive a high-resolution drainage network from SRTM, we have used the D8 method [42], implemented using PCRaster [43,44]. The stream delineation step was applied for every catchment of the Murray-Darling basin. The HydroBASINS level 8 catchment geometries were used for this purpose. This step was required due to the large size of the catchment making it difficult to perform processing in a single step.  Extraction of water features from OSM is a relatively trivial task, mainly consisting of a careful selection of proper filters and data conversion tools. We have used data conversion tools provided by OSM and a set of programs based on GDAL [41] to select only relevant features representing surface water.
To derive a high-resolution drainage network from SRTM, we have used the D8 method [42], implemented using PCRaster [43,44]. The stream delineation step was applied for every catchment of the Murray-Darling basin. The HydroBASINS level 8 catchment geometries were used for this purpose. This step was required due to the large size of the catchment making it difficult to perform processing in a single step.
Extraction of the water mask using Landsat 8 imagery was the most challenging task and required the development of a new method to derive a water mask from multi-spectral imagery. The new method combines the use of the MNDWI spectral index (Equation (2)) extended with a non-parametric detection of a local threshold to improve the accuracy of water detection. The MNDWI is very similar to the NDWI spectral index (Equation (1)) but uses a shortwave infrared band instead of near-infrared. Additional steps include the use of the Normalized Difference Vegetation Index (NDVI) [45] (Equation (3)) with a high threshold value (0.3) to exclude false water detection in Extraction of the water mask using Landsat 8 imagery was the most challenging task and required the development of a new method to derive a water mask from multi-spectral imagery. The new method combines the use of the MNDWI spectral index (Equation (2)) extended with a non-parametric detection of a local threshold to improve the accuracy of water detection. The MNDWI is very similar to the NDWI spectral index (Equation (1)) but uses a shortwave infrared band instead of near-infrared. Additional steps include the use of the Normalized Difference Vegetation Index (NDVI) [45] (Equation (3)) with a high threshold value (0.3) to exclude false water detection in very dark vegetated areas.
We have included an additional classification step to refine the water mask for hilly areas, where the results of automated classification led to high commission errors. We have used a supervised classification based on Classification And Regression Tree (CART) [46], which was trained using a manually digitized training set to distinguish between water and land pixels.
Our selection of the input datasets was based on the assumption that the accuracy of all three datasets is similar. However, estimation of the actual errors of OSM would be difficult, mainly because OSM features are usually based on different measurement methods (GPS traces, or manual digitizing using medium-or high-resolution imagery of a bulk import from other databases of varying nature). For Landsat 8 and SRTM, the main limitations are well known [35,47]. Horizontal accuracy of Landsat 8 is known to be better than 12 m in 90% of the 30 m resolution images of the Operational Land Imager (OLI). For SRTM, vertical relative and absolute errors can be explained by its radar nature and are in the order of 10 and 16 meters for 90% of the data, with an absolute geolocation error below 13 m.

Derivation of Hydrological Variables: Drainage Network and HAND
In addition to the drainage network, a number of other hydrological parameters were derived during SRTM analysis, such as local drainage direction, flow accumulation, and HAND (see Figure 3). Our selection of the input datasets was based on the assumption that the accuracy of all three datasets is similar. However, estimation of the actual errors of OSM would be difficult, mainly because OSM features are usually based on different measurement methods (GPS traces, or manual digitizing using medium-or high-resolution imagery of a bulk import from other databases of varying nature). For Landsat 8 and SRTM, the main limitations are well known [35,47]. Horizontal accuracy of Landsat 8 is known to be better than 12 m in 90% of the 30 m resolution images of the Operational Land Imager (OLI). For SRTM, vertical relative and absolute errors can be explained by its radar nature and are in the order of 10 and 16 meters for 90% of the data, with an absolute geolocation error below 13 m.

Derivation of Hydrological Variables: Drainage Network and HAND
In addition to the drainage network, a number of other hydrological parameters were derived during SRTM analysis, such as local drainage direction, flow accumulation, and HAND (see Figure 3).
A (potential) water mask was generated by thresholding HAND values using 1 m elevation above the nearest drain. We found this approach to work better in estimating water masks in flat areas compared to the methods based on flow accumulation.
All hydrological variables were derived using the following steps: (1) clip DEM using a Google Earth Engine script and one of 1494 HydroBASIN polygons and download it to Google Compute Engine (GCE); (2) compute slopes, local drainage directions (LDD, includes pit-removal step), flow accumulations, and HAND; and (3) upload results to Google Earth Engine [48] and Google Fusion Table [49] so they can be used in parallel-processing scripts for further analysis. To compute HAND, a drainage network had to be estimated by thresholding the flow accumulation. We have used threshold equal to 100 upstream cells, which was sufficient to detect most of the potential rivers.  The resulting HAND can be used for the estimation of potential flood areas, but also to detect pixels where potential errors occur due to hill shadows. These areas were estimated by generating a binary mask based on a variation of HAND values. The mask showing potential hilly areas was computed by marking pixels as hilly in a case where HAND values larger or equal to 30 m were found in the 300 m radius neighborhood.

Method of Water Detection Using Landsat 8
Our method of water detection from multi-spectral multi-temporal imagery is based on a step-wise approach combining unsupervised and supervised classification steps ( Figure 4). The unsupervised step was applied first to detect the initial water mask using percentile images of reflectance, resulting in minor omission errors in flat areas. However, we obtain very high commission errors in hilly areas A (potential) water mask was generated by thresholding HAND values using 1 m elevation above the nearest drain. We found this approach to work better in estimating water masks in flat areas compared to the methods based on flow accumulation.
All hydrological variables were derived using the following steps: (1) clip DEM using a Google Earth Engine script and one of 1494 HydroBASIN polygons and download it to Google Compute Engine (GCE); (2) compute slopes, local drainage directions (LDD, includes pit-removal step), flow accumulations, and HAND; and (3) upload results to Google Earth Engine [48] and Google Fusion Table [49] so they can be used in parallel-processing scripts for further analysis. To compute HAND, a drainage network had to be estimated by thresholding the flow accumulation. We have used threshold equal to 100 upstream cells, which was sufficient to detect most of the potential rivers.
The resulting HAND can be used for the estimation of potential flood areas, but also to detect pixels where potential errors occur due to hill shadows. These areas were estimated by generating a binary mask based on a variation of HAND values. The mask showing potential hilly areas was computed by marking pixels as hilly in a case where HAND values larger or equal to 30 m were found in the 300 m radius neighborhood.

Method of Water Detection Using Landsat 8
Our method of water detection from multi-spectral multi-temporal imagery is based on a step-wise approach combining unsupervised and supervised classification steps ( Figure 4). The unsupervised step was applied first to detect the initial water mask using percentile images of reflectance, resulting in minor omission errors in flat areas. However, we obtain very high commission errors in hilly areas due to terrain shadow. Therefore, an additional step was required to refine the water mask using supervised classification. due to terrain shadow. Therefore, an additional step was required to refine the water mask using supervised classification. The unsupervised classification step is based on the local adaptive threshold detection method presented in Section 2.4.2. The supervised classification step, based on CART, was introduced to reduce commission errors found in the case of shadows and snow/ice pixels. It was performed only for hilly areas where misclassified pixels were observed. Hilly areas were detected using a threshold of Hmax, representing the minimum HAND value (30 m in our case). In this way, we could keep omission errors low for flat areas and ensure low commission errors for hilly areas. The final error of water detection was very low (less than 1%), mainly due to a presence of mixed pixels or incomplete training data for hilly areas. An additional step was required to exclude very dark vegetated areas, resulting in high MNDWI values and, therefore, misclassified as water. These errors were removed by eliminating pixels with NDVI values greater than 0.3. To compute the final water mask, the study area was divided into regular grid tiles of 0.2 × 0.2 degrees in size ( Figure A1). This step was needed to make sure that dynamic MNDWI threshold is estimated for every tile, but also to parallelize the processing. Finally, the above workflow was applied to every tile.

Cloud-Free Landsat 8 Percentile Images
Our method to exclude clouds and shadows in the satellite imagery is based on the use of percentile images extracted from an image collection, spanning a two year period (2013-2015), instead of the original images. The percentile images were computed on a per-pixel and per-band basis using 50-130 top of atmosphere (TOA) intensity values. The larger number of images comes from a higher revisit frequency due to an overlap in the satellite swath.
The percentile images ( Figure 5) appeared to describe the water dynamics in a better way when compared to the interval mean images used in the other studies [12,13]. Even though we have confirmed this result only by visual inspection, the reasoning comes from the fact that the water surface area may change sharply depending on local topographic conditions. This water area change results in sharp changes between water masks present in different percentiles (see Figure 6).  The unsupervised classification step is based on the local adaptive threshold detection method presented in Section 2.4.2. The supervised classification step, based on CART, was introduced to reduce commission errors found in the case of shadows and snow/ice pixels. It was performed only for hilly areas where misclassified pixels were observed. Hilly areas were detected using a threshold of Hmax, representing the minimum HAND value (30 m in our case). In this way, we could keep omission errors low for flat areas and ensure low commission errors for hilly areas. The final error of water detection was very low (less than 1%), mainly due to a presence of mixed pixels or incomplete training data for hilly areas. An additional step was required to exclude very dark vegetated areas, resulting in high MNDWI values and, therefore, misclassified as water. These errors were removed by eliminating pixels with NDVI values greater than 0.3.
To compute the final water mask, the study area was divided into regular grid tiles of 0.2ˆ0.2 degrees in size ( Figure A1). This step was needed to make sure that dynamic MNDWI threshold is estimated for every tile, but also to parallelize the processing. Finally, the above workflow was applied to every tile.

Cloud-Free Landsat 8 Percentile Images
Our method to exclude clouds and shadows in the satellite imagery is based on the use of percentile images extracted from an image collection, spanning a two year period (2013-2015), instead of the original images. The percentile images were computed on a per-pixel and per-band basis using 50-130 top of atmosphere (TOA) intensity values. The larger number of images comes from a higher revisit frequency due to an overlap in the satellite swath.
The percentile images ( Figure 5) appeared to describe the water dynamics in a better way when compared to the interval mean images used in the other studies [12,13]. Even though we have confirmed this result only by visual inspection, the reasoning comes from the fact that the water surface area may change sharply depending on local topographic conditions. This water area change results in sharp changes between water masks present in different percentiles (see Figure 6).  Over the Murray-Darling Basin, the percentile range of 15%-55% of all TOA intensities was empirically found to be suitable for permanent water detection. However, for semi-arid and flat areas, where cloud frequency is very low, a larger range of percentiles (up to 90%) could be used as well. Average images generated for very low percentiles usually result in too many artifacts present in the images due to cloud and hill shadows, making them difficult to interpret automatically (Figure 6a,b). At the same time, the use of lower percentiles has a higher chance to represent a larger amount of surface water, present during floods and wet seasons, and a smaller amount of surface water  Over the Murray-Darling Basin, the percentile range of 15%-55% of all TOA intensities was empirically found to be suitable for permanent water detection. However, for semi-arid and flat areas, where cloud frequency is very low, a larger range of percentiles (up to 90%) could be used as well. Average images generated for very low percentiles usually result in too many artifacts present in the images due to cloud and hill shadows, making them difficult to interpret automatically (Figure 6a,b). At the same time, the use of lower percentiles has a higher chance to represent a larger amount of surface water, present during floods and wet seasons, and a smaller amount of surface water Over the Murray-Darling Basin, the percentile range of 15%-55% of all TOA intensities was empirically found to be suitable for permanent water detection. However, for semi-arid and flat areas, where cloud frequency is very low, a larger range of percentiles (up to 90%) could be used as well. Average images generated for very low percentiles usually result in too many artifacts present in the images due to cloud and hill shadows, making them difficult to interpret automatically (Figure 6a,b).
At the same time, the use of lower percentiles has a higher chance to represent a larger amount of surface water, present during floods and wet seasons, and a smaller amount of surface water observed during dry seasons (Figure 6d-f). For a more detailed analysis, the choice of the upper and lower percentiles can be estimated by taking cloud frequency and topographic conditions into account.

Adaptive Threshold Detection Using MNDWI, Canny Edge Filter, and Otsu Thresholding
As was previously mentioned, varying thresholds of spectral water indices usually limit their applicability, requiring manual threshold adjustment because the spectral properties of open water are not the same everywhere. In many cases, surface water constitutes only a small fraction of the overall land cover (Figure 7), making it harder to detect with threshold-based methods. Large local errors may be introduced when a constant threshold of 0.0 is used to distinguish between water and land ( Figure 8). The challenge is to establish a varying threshold, that can be derived automatically. As was previously mentioned, varying thresholds of spectral water indices usually limit their applicability, requiring manual threshold adjustment because the spectral properties of open water are not the same everywhere. In many cases, surface water constitutes only a small fraction of the overall land cover (Figure 7), making it harder to detect with threshold-based methods. Large local errors may be introduced when a constant threshold of 0.0 is used to distinguish between water and land ( Figure 8). The challenge is to establish a varying threshold, that can be derived automatically. The use of non-parametric threshold detection methods, such as histogram-based Otsu thresholding helps to overcome some of the problems [15,16], but this approach does not work when the fraction of water pixels is small.
The main principle of our method is an extension of the Otsu-based methods by a Canny edge filter to reduce the number of input pixels only to those located near water-land edges. The use of the Canny edge filter with a very high threshold applied to the spectral index image reveals only edges located near sharp value changes. In the case of water spectral indices, this usually takes place when the near-infrared band abruptly changes in value from one pixel to the next. As was previously mentioned, varying thresholds of spectral water indices usually limit their applicability, requiring manual threshold adjustment because the spectral properties of open water are not the same everywhere. In many cases, surface water constitutes only a small fraction of the overall land cover (Figure 7), making it harder to detect with threshold-based methods. Large local errors may be introduced when a constant threshold of 0.0 is used to distinguish between water and land ( Figure 8). The challenge is to establish a varying threshold, that can be derived automatically. The use of non-parametric threshold detection methods, such as histogram-based Otsu thresholding helps to overcome some of the problems [15,16], but this approach does not work when the fraction of water pixels is small.
The main principle of our method is an extension of the Otsu-based methods by a Canny edge filter to reduce the number of input pixels only to those located near water-land edges. The use of the Canny edge filter with a very high threshold applied to the spectral index image reveals only edges located near sharp value changes. In the case of water spectral indices, this usually takes place when the near-infrared band abruptly changes in value from one pixel to the next. The use of non-parametric threshold detection methods, such as histogram-based Otsu thresholding helps to overcome some of the problems [15,16], but this approach does not work when the fraction of water pixels is small.
The main principle of our method is an extension of the Otsu-based methods by a Canny edge filter to reduce the number of input pixels only to those located near water-land edges. The use of the Canny edge filter with a very high threshold applied to the spectral index image reveals only edges located near sharp value changes. In the case of water spectral indices, this usually takes place when the near-infrared band abruptly changes in value from one pixel to the next.
Potential water and land pixels located near water are then computed using morphological dilation applied to the detected edges. It is important to note that this approach can result in a skewed distribution in the case of thin, single pixel wide water bodies (canals). To overcome this problem we have used a buffer size (dilation) equal to half of the pixel in Step 3. In an ideal situation, the resulting distribution should look bimodal (Figure 9) so that a clear distinction between land and water can be made from this distribution. Potential water and land pixels located near water are then computed using morphological dilation applied to the detected edges. It is important to note that this approach can result in a skewed distribution in the case of thin, single pixel wide water bodies (canals). To overcome this problem we have used a buffer size (dilation) equal to half of the pixel in Step 3. In an ideal situation, the resulting distribution should look bimodal (Figure 9) so that a clear distinction between land and water can be made from this distribution. The proposed method was applied to the cloud-free percentile images. In a case of two classes in the grid tile, we were able to get an almost perfect detection of water pixels using the following parameters: = 0.7, ℎ = 0.99 for the Canny edge filter, and a structuring element with the size 15 m × 15 m to dilate the edges in Step (c) and create a surrounding buffer region. The and ℎ parameters are used to define the standard deviation of the Gaussian smoothing kernel and the threshold used to define sensitivity of the filter, respectively.

Refining Water Detection Using Supervised Classification Based on CART and HAND
The method was applied over 1725 spatial boxes of 20 km × 20 km covering the Murray-Darling basin. The 20 km × 20 km area was chosen arbitrarily, and it is assumed that the threshold values are the same within each spatial box area. The resulting MNDWI threshold values have a range of −0.25 to 0.4 (Figure 10), which clearly highlights the need for varying the MNDWI threshold, spatially.  The proposed method was applied to the cloud-free percentile images. In a case of two classes in the grid tile, we were able to get an almost perfect detection of water pixels using the following parameters: σ " 0.7, th " 0.99 for the Canny edge filter, and a structuring element with the size 15 mˆ15 m to dilate the edges in Step (c) and create a surrounding buffer region. The σ and th parameters are used to define the standard deviation of the Gaussian smoothing kernel and the threshold used to define sensitivity of the filter, respectively.

Refining Water Detection Using Supervised Classification Based on CART and HAND
The method was applied over 1725 spatial boxes of 20 kmˆ20 km covering the Murray-Darling basin. The 20 kmˆ20 km area was chosen arbitrarily, and it is assumed that the threshold values are the same within each spatial box area. The resulting MNDWI threshold values have a range of´0.25 to 0.4 (Figure 10), which clearly highlights the need for varying the MNDWI threshold, spatially.
The use of simple water spectral indices in hilly areas is usually not sufficient, as they frequently result in very high commission errors due to false detection of water pixels. This misclassification is especially true for MNDWI, which is more sensitive to hill shadows than NDWI. The reason is the spectral signature of hill shadows, which looks similar to the one corresponding to water, resulting in large MNDWI values. To remove these errors, we have trained a CART classifier ( Figure 11) using a manually-created training set, all Landsat 8 bands, and HAND. The training set was created only to include those pixels that were misclassified during the unsupervised step, which appeared to be true only in hilly areas. The HAND values used to train the classifier were included only when they were greater than 10 m. This constraint was introduced to reduce the influence of SRTM errors that get higher near water bodies.

Refining Water Detection Using Supervised Classification Based on CART and HAND
The method was applied over 1725 spatial boxes of 20 km × 20 km covering the Murray-Darling basin. The 20 km × 20 km area was chosen arbitrarily, and it is assumed that the threshold values are the same within each spatial box area. The resulting MNDWI threshold values have a range of −0.25 to 0.4 (Figure 10), which clearly highlights the need for varying the MNDWI threshold, spatially.  The use of simple water spectral indices in hilly areas is usually not sufficient, as they frequently result in very high commission errors due to false detection of water pixels. This misclassification is especially true for MNDWI, which is more sensitive to hill shadows than NDWI. The reason is the spectral signature of hill shadows, which looks similar to the one corresponding to water, resulting in large MNDWI values. To remove these errors, we have trained a CART classifier ( Figure 11) using a manually-created training set, all Landsat 8 bands, and HAND. The training set was created only to include those pixels that were misclassified during the unsupervised step, which appeared to be true only in hilly areas. The HAND values used to train the classifier were included only when they were greater than 10 m. This constraint was introduced to reduce the influence of SRTM errors that get higher near water bodies. The classification was performed using a Google Earth Engine implementation of CART with a tree depth increased from 10 (default value) to 20. The larger tree depth was required to avoid overfitting and because our study basin covers a relatively large area, resulting in a large variation of water and land use types. Overfitting was detected by observing the confusion matrix generated after training the classifier. In the final training set, the confusion error was very close to zero.
The final training set contains about 500 polygons created manually and iteratively by training the classifier for one set of tiles and then validating it for all other tiles where supervised classification was required. This step was required mainly for the grid tiles located in the southern part of the catchment (hilly landscapes).
Since Landsat can reveal water features better than SRTM, we have also decided to analyze the HAND values of all pixels used during the training stage ( Figure 12). The results reveal that HAND values can take values up to 40 m even though visual inspection shows that these pixels belong to water. A low SRTM accuracy can explain this mismatch near water. Another confusing result comes from a comparison of NDWI and MNDWI. Theoretically, they should be strongly correlated. However, substantial differences were observed in multiple locations. These differences occur The classification was performed using a Google Earth Engine implementation of CART with a tree depth increased from 10 (default value) to 20. The larger tree depth was required to avoid overfitting and because our study basin covers a relatively large area, resulting in a large variation of water and land use types. Overfitting was detected by observing the confusion matrix generated after training the classifier. In the final training set, the confusion error was very close to zero.
The final training set contains about 500 polygons created manually and iteratively by training the classifier for one set of tiles and then validating it for all other tiles where supervised classification was required. This step was required mainly for the grid tiles located in the southern part of the catchment (hilly landscapes).
Since Landsat can reveal water features better than SRTM, we have also decided to analyze the HAND values of all pixels used during the training stage ( Figure 12). The results reveal that HAND values can take values up to 40 m even though visual inspection shows that these pixels belong to water. A low SRTM accuracy can explain this mismatch near water. Another confusing result comes from a comparison of NDWI and MNDWI. Theoretically, they should be strongly correlated. However, substantial differences were observed in multiple locations. These differences occur because many pixels requiring manual classification are mixed pixels, partially covered by land. Further research would be necessary to explain these results.

River Centerline Estimation from Landsat 8 Water Mask
An additional skeletonization step was required before computing positional differences between OSM linear water features and water masks detected from Landsat imagery (or HAND). We have used the method of mathematical morphology [50] by applying an iterative thinning operator applying a hit-or-miss transform to a binary water mask image. The actual steps include: where S denotes a skeleton image constructed from W-original binary water mask, K-number of iterations for morphological skeletonization operator defined by , and -are the two composite structuring elements, and ⊗ is a binary thinning operator defined as: where ⊙ is a hit-or-miss transform operator defined as: where is the set complement of , and ⊖ is erosion operator, represents background and -foreground parts of the composite structured element B = ( , ).
The structuring elements B 1 and B 2 ( Figure 13) allow reconstruction of river skeletons even without the need to introduce pruning (the removal of small branches), which is usually required during such processing. The actual implementation performs hit-or-miss transform using four rotated versions of the structuring elements applied during every thinning iteration.
Additionally, morphological smoothing was performed before the skeletonization step, to make sure fewer branches were present in the final centerline.
To implement most algorithms, we have used the Google Earth Engine (GEE) parallel computing platform [48]. Some of the computations, where GEE was less suitable, were performed using Google Compute Engine (GCE) running Ubuntu 15.04 (Vivid Vervet). The use of a dedicated machine was more suitable for the computation of hydrological parameters from the DEM, for which no algorithms are available yet in the GEE environment. JavaScript and Python were used as programming languages, as well as some open source tools and libraries (GDAL [41], Fiona [51],

River Centerline Estimation from Landsat 8 Water Mask
An additional skeletonization step was required before computing positional differences between OSM linear water features and water masks detected from Landsat imagery (or HAND). We have used the method of mathematical morphology [50] by applying an iterative thinning operator applying a hit-or-miss transform to a binary water mask image. The actual steps include: S pWq " S K´SK´1´. . .´S 0 pWq¯¯¯(4) where S denotes a skeleton image constructed from W-original binary water mask, K-number of iterations for morphological skeletonization operator defined by S k , B 1 and B 2 -are the two composite structuring elements, and b is a binary thinning operator defined as: where d is a hit-or-miss transform operator defined as: where W c is the set complement of W, and a is erosion operator, B 1 represents background and B 2 -foreground parts of the composite structured element B = pB 1 , B 2 q. The structuring elements B 1 and B 2 ( Figure 13) allow reconstruction of river skeletons even without the need to introduce pruning (the removal of small branches), which is usually required during such processing. The actual implementation performs hit-or-miss transform using four rotated versions of the structuring elements applied during every thinning iteration.

Estimation of Positional Differences between Rivers
After the centerline is computed, positional differences can be easily computed using Goodchild's method of increasing overlay polygons.
The above method works best when the lengths of the segments and the distance between segments are significantly larger than the maximum buffer size used during dilation.

Positional Differences between OpenStreetMap, Landsat, and SRTM
Goodchild's method to estimate positional differences was applied for every line segment of the OSM dataset and two water mask raster datasets: (1) the drainage network estimated using HAND and (2) water centerline estimated using the Landsat 8 water mask. The final differences were estimated using Score = 0.85 as a threshold, which corresponds to a distance where 85% of the pixels from the second dataset (black line in Figure 14) are covered by a dilation via this distance. The frequency and cumulative histograms of the resulting differences are presented in Figure 15.  Additionally, morphological smoothing was performed before the skeletonization step, to make sure fewer branches were present in the final centerline.
To implement most algorithms, we have used the Google Earth Engine (GEE) parallel computing platform [48]. Some of the computations, where GEE was less suitable, were performed using Google Compute Engine (GCE) running Ubuntu 15.04 (Vivid Vervet). The use of a dedicated machine was more suitable for the computation of hydrological parameters from the DEM, for which no algorithms are available yet in the GEE environment. JavaScript and Python were used as programming languages, as well as some open source tools and libraries (GDAL [41], Fiona [51], Shapely [52], ee-runner [53], and PCRaster [44]). We aimed at automating most of the steps to make sure that they would easily scale to a planetary scale, and that the results of the research can be reproduced when updated versions of the underlying datasets will be released.

Estimation of Positional Differences between Rivers
After the centerline is computed, positional differences can be easily computed using Goodchild's method of increasing overlay polygons.
The above method works best when the lengths of the segments and the distance between segments are significantly larger than the maximum buffer size D bmax used during dilation.

Positional Differences between OpenStreetMap, Landsat, and SRTM
Goodchild's method to estimate positional differences was applied for every line segment of the OSM dataset and two water mask raster datasets: (1) the drainage network estimated using HAND and (2) water centerline estimated using the Landsat 8 water mask. The final differences were estimated using Score = 0.85 as a threshold, which corresponds to a distance where 85% of the pixels from the second dataset (black line in Figure 14) are covered by a dilation via this distance. The frequency and cumulative histograms of the resulting differences are presented in Figure 15.
OSM dataset and two water mask raster datasets: (1) the drainage network estimated using HAND and (2) water centerline estimated using the Landsat 8 water mask. The final differences were estimated using Score = 0.85 as a threshold, which corresponds to a distance where 85% of the pixels from the second dataset (black line in Figure 14) are covered by a dilation via this distance. The frequency and cumulative histograms of the resulting differences are presented in Figure 15.  The second peak of the left histogram (distance > 100 m) occurs mainly due to wider water bodies, where the centerline cannot be reached within 20 thinning steps, but also for water bodies that start to overlap within the maximum size dilation, such as oxbows or meandering rivers located close to each other. Positional differences between Landsat centerlines and OSM become larger when a river width becomes larger. These large differences are observed because OSM polylines frequently represent the thalweg (the line of lowest elevation within a river) instead of a medial axis.
It can be seen that the distance between OSM water polylines and the centerlines computed using the Landsat 8 water mask falls in the range of 30-60 m for 60% of the water features. We were able to confirm this for about 17% (N = 5687) of the linear OSM water features. The length of the river segments was selected to be about 0.02 degree (~2.2 km).
In the case of SRTM, about 30% (N = 9887) of the OSM segments could be compared, mainly located in the southern, hilly part of the catchment (see Figure 16). The distances are slightly larger than those for Landsat and also appear normally distributed with a mean value of 110 m. The differences can be explained by a systematic shift between the OSM and SRTM datasets, combined with other factors, like the fact that smaller river meanders are not resolved well by SRTM dataset. The second peak of the left histogram (distance > 100 m) occurs mainly due to wider water bodies, where the centerline cannot be reached within 20 thinning steps, but also for water bodies that start to overlap within the maximum size dilation, such as oxbows or meandering rivers located close to each other. Positional differences between Landsat centerlines and OSM become larger when a river width becomes larger. These large differences are observed because OSM polylines frequently represent the thalweg (the line of lowest elevation within a river) instead of a medial axis.
It can be seen that the distance between OSM water polylines and the centerlines computed using the Landsat 8 water mask falls in the range of 30-60 m for 60% of the water features. We were able to confirm this for about 17% (N = 5687) of the linear OSM water features. The length of the river segments was selected to be about 0.02 degree (~2.2 km).
In the case of SRTM, about 30% (N = 9887) of the OSM segments could be compared, mainly located in the southern, hilly part of the catchment (see Figure 16). The distances are slightly larger than those for Landsat and also appear normally distributed with a mean value of 110 m. The differences can be explained by a systematic shift between the OSM and SRTM datasets, combined with other factors, like the fact that smaller river meanders are not resolved well by SRTM dataset.
using the Landsat 8 water mask falls in the range of 30-60 m for 60% of the water features. We were able to confirm this for about 17% (N = 5687) of the linear OSM water features. The length of the river segments was selected to be about 0.02 degree (~2.2 km).
In the case of SRTM, about 30% (N = 9887) of the OSM segments could be compared, mainly located in the southern, hilly part of the catchment (see Figure 16). The distances are slightly larger than those for Landsat and also appear normally distributed with a mean value of 110 m. The differences can be explained by a systematic shift between the OSM and SRTM datasets, combined with other factors, like the fact that smaller river meanders are not resolved well by SRTM dataset.  Spatial representation of the distances between OSM river segments and centerlines of the Landsat water mask, as well as the distances between OSM river segments and drainage network cells, are shown in Figure 16. It can be clearly seen that, in the case of Landsat, mainly large rivers could be compared due to the 30 m resolution limitation of the OLI sensor. At the same time, a much larger number of the OSM segments could be confirmed using the drainage network derived from 30 m SRTM, even though positional differences are not as good as in the case of Landsat. On the other hand, the drainage network derived from SRTM seems to be very inaccurate for flatter landscapes. Therefore, we have excluded drainage network pixels for the cases where HAND values are smaller than 30 m in the 300 m radius neighborhood. These parameters were identified empirically after careful visual inspection.

Goodness of Fit between OpenStreetMap and Landsat Water Masks
To estimate the overall overlap between water masks extracted from both OSM and Landsat, we have aggregated the overlap using a regular grid for a better understanding of the results. The analysis was performed using the actual water mask (polygonal and linear features of the OSM). For every grid cell, a total surface area of water mask was computed. For linear features, a dilation buffer of 15 m was applied using a square kernel to make sure it matches, at least, one Landsat grid cell. Then, the resulting thematic differences were computed ( Figure 17).
While the centerline analysis reveals a good fit between Landsat and OSM, the surface area analysis demonstrates quite a large mismatch. There can be several explanations for this mismatch. Firstly, OSM (as well as Google Maps) misses a large number of small agricultural reservoirs in this area. Secondly, many of the large reservoirs are intermittent and were mostly dry during 2013-2015. Thirdly, riverbank information frequently does not match between two datasets; in some cases it is missing in OSM, in other instances Landsat misses small rivers (W < 30 m) or the water bodies are partially covered by vegetation.
The surface water area of both OSM and Landsat constitutes about 0.85% of the total catchment area. Only about one-third (32%) of a total surface water area can be observed using both datasets. The rest of the surface water can be seen using only Landsat or only OSM (Table 2). careful visual inspection.

Goodness of Fit between OpenStreetMap and Landsat Water Masks
To estimate the overall overlap between water masks extracted from both OSM and Landsat, we have aggregated the overlap using a regular grid for a better understanding of the results. The analysis was performed using the actual water mask (polygonal and linear features of the OSM). For every grid cell, a total surface area of water mask was computed. For linear features, a dilation buffer of 15 m was applied using a square kernel to make sure it matches, at least, one Landsat grid cell. Then, the resulting thematic differences were computed ( Figure 17).   The results of the research, including the newly derived permanent water mask, hydrological datasets, source code required to reproduce the analysis and a supporting website, can be found in supplementary materials accompanying the paper (Table A1, Figure A2).

Discussion
The research demonstrates clear benefits of the use of the new imagery acquired by Landsat 8 to detect water bodies when combined with water masks derived from other sources. We have also found the SRTM to be an excellent complementary dataset enabling improvement of the water mask detection method for hilly areas, after its transformation into HAND.
However, none of the three water masks were found to be perfect regarding positional differences and completeness. The main issue of the water masks derived from Landsat 8 is its shortcomings in detecting small water features, such as small rivers or man-made canals and detecting water bodies (partially) covered by riparian or surface water vegetation. The main limitation with the drainage network derived from SRTM is its inability to detect river features for flat terrain conditions. The latter constituted a major part of our study area. However, this might not be the case when applied elsewhere. An additional challenge is related to the presence of high-frequency noise and a relatively poor quality of SRTM near water bodies. The noise can be explained by the radar origin of the dataset.
One of the next logical steps of the present research could be the development of a data fusion algorithm using benefits from all three datasets. Such development would require the introduction of objective criteria regarding confidence of every water mask depending on topographic and other conditions. Another step might be to perform the same analysis globally. However, performing global analysis would require significantly larger computational efforts and includes both detection of the water mask and estimation of HAND at 30 m resolution. Additional validation of the OSM and its fusion with the datasets produced by the local governmental agencies (Surface Hydrology, Water Observations from Space, local high-resolution elevation models) will help in harmonizing existing vector and raster water mask datasets.
Possible improvements to the method of water detection might include utilization of the panchromatic band and entropy-based methods, in addition to the spectral methods. Additional significant improvements can be achieved through the use of the other medium or high-resolution satellite missions such as Sentinel 2, PlanetLabs, and SkyBox. The use of higher resolution imagery would allow detection of much smaller (width < 30 m) river features, resulting in improved coverage.
The method of water detection can be easily extended to use Landsat 7 or any other multi-spectral imagery, for example, to generate inter-annual water masks or to study water dynamics.
The proposed method of water detection might face difficulties in the areas where an insufficient number of cloud-free observations are available, for example, in very wet or cold climates. In this case, it might be difficult to determine a correct range of the cloud-free percentiles to be used for water detection.

Conclusions
The new method allows detection of a high-resolution water mask using optical multispectral satellite imagery. The resulting water mask, together with the newly-generated 30 m-resolution drainage network derived from SRTM, was compared to the water mask extracted from OSM.
Our results reveal that the best surface water mask should be generated by combining all three datasets that were analyzed in the current study; water masks extracted from OSM, optical satellite imagery and the drainage network derived from high-resolution digital elevation models for hilly areas.
We found a good agreement, concerning positional accuracy, between river water features from OSM and water masks derived from Landsat 8. However, only 32% of the total OSM and Landsat 8 water mask matches when analyzing the actual intersecting surface area.
The newly generated Landsat 8 water mask reveals many new water bodies previously not present in OSM or any other vector dataset we have explored. A large part constitutes a large number of small agricultural reservoirs located in the northern and southern parts of the catchment.
All scripts used in the research, including the training set used for water mask refinement in hilly areas, can be found in the following GitHub repository [54]. The repository contains all scripts which allow one to: (1) extract water features from an OpenStreetMap planet file; (2) download SRTM DEM files clipped by HydroBASIN catchments and perform calculation of HAND using the Python version of PCRaster tools; (3) Google Earth Engine JavaScript scripts used to generate regular grids for processing, detect water masks for a given grid tile, as well as scripts required to skeletonize that water mask and compute distances between OSM segments and another river raster dataset (drainage network derived from HAND or Landsat water mask centerline). Additional supplementary scripts are available in a form of Jupyter Notebooks. These scripts were used to clean up HydroBASINS catchments, generate tiled version of HAND from basin-based images, split OpenStreetMap water geometries into smaller segments to perform local spatial analysis, as well as scripts used to generate histograms from GeoJSON files produced by water detection scripts.

A1. Grids Used During Analysis
The following two grids were used during analysis: (1) HydroBASINS level 8 catchment polygons and (2) regular~20ˆ20 km grids ( Figure A1). The first one was used to parallelize the generation of hydrological parameters. The latter one was used to perform water detection and to visualize the results in an aggregated form, and also to parallelize the water detection analysis using Landsat imagery.

A2. Results as Raster and Vector Datasets
Some of the datasets used during processing were uploaded and shared in a form of Google Fusion Tables and Google Earth Engine Raster Assets (Table A1).

A2. Results as Raster and Vector Datasets
Some of the datasets used during processing were uploaded and shared in a form of Google Fusion Tables and Google Earth Engine Raster Assets (Table A1).

A3. Website
Most of the datasets produced by this research can be accessed using a website [57] dedicated to this study ( Figure A2). The website is hosted in a Google Cloud and uses Google App Engine infrastructure, providing integration with Google Earth Engine where most of the datasets produced in this study can be found.