Synergistic Use of Geospatial Data for Water Body Extraction from Sentinel-1 Images for Operational Flood Monitoring across Southeast Asia Using Deep Neural Networks

: Deep learning is a promising method for image classiﬁcation, including satellite images acquired by various sensors. However, the synergistic use of geospatial data for water body extraction from Sentinel-1 data using deep learning and the applicability of existing deep learning models have not been thoroughly tested for operational ﬂood monitoring. Here, we present a novel water body extraction model based on a deep neural network that exploits Sentinel-1 data and ﬂood-related geospatial datasets. For the model, the U-Net was customised and optimised to utilise Sentinel-1 data and other ﬂood-related geospatial data, including digital elevation model (DEM), Slope, Aspect, Proﬁle Curvature (PC), Topographic Wetness Index (TWI), Terrain Ruggedness Index (TRI), and Buffer for the Southeast Asia region. Testing and validation of the water body extraction model was applied to three Sentinel-1 images for Vietnam, Myanmar, and Bangladesh. By segmenting 384 Sentinel-1 images, model performance and segmentation accuracy for all of the 128 cases that the combination of stacked layers had determined were evaluated following the types of combined input layers. Of the 128 cases, 31 cases showed improvement in Overall Accuracy (OA), and 19 cases showed improvement in both averaged intersection over union (IOU) and F1 score for the three Sentinel-1 images segmented for water body extraction. The averaged OA, IOU, and F1 scores of the ‘Sentinel-1 VV’ band are 95.77, 80.35, and 88.85, respectively, whereas those of ‘band combination VV, Slope, PC, and TRI’ are 96.73, 85.42, and 92.08, showing improvement by exploiting geospatial data. Such improvement was further veriﬁed with water body extraction results for the Chindwin river basin, and quantitative analysis of ‘band combination VV, Slope, PC, and TRI’ showed an improvement of the F1 score by 7.68 percent compared to the segmentation output of the ‘Sentinel-1 VV’ band. Through this research, it was demonstrated that the accuracy of deep learning-based water body extraction from Sentinel-1 images can be improved up to 7.68 percent by employing geospatial data. To the best of our knowledge, this is the ﬁrst work of research that demonstrates the synergistic use of geospatial data in deep learning-based water body extraction over wide areas. It is anticipated that the results of this research could be a valuable reference when deep neural networks are applied for satellite image segmentation for operational ﬂood monitoring and when geospatial layers are employed to improve the accuracy of deep learning-based image segmentation.


Introduction
Floods, which make up 52.1% of natural disasters in frequency, occur unexpectedly and cause devastating damage over broad areas [1][2][3].It was reported that hydrological disasters, including floods, were responsible for 19.3% of the total damage caused by natural disasters and 20.4% of the number of total victims [4].Thus, flood monitoring, including flooded area extraction and estimation, is critical to respond to, and recover from, such damage.Satellite remote sensing techniques have been used to estimate flooded areas, as they can provide visual information over wide areas [5][6][7] yet timely monitoring, and estimating inundated areas in flood situations have been limited by satellite data acquisition and analysing such data that includes the accuracy of classification for extracting flooded areas from available satellite data.Poor classification accuracy could cause more flood damages, as such damages depend heavily on the quality of flood forecasting, flood area estimation, and settlement patterns [8].
As acquiring optical satellite data is mainly limited by natural constraints, e.g., weather condition and cloud cover, during the rainy season [9], spaceborne Synthetic Aperture Radar (SAR) data have been considered to be suitable for flood monitoring [10][11][12].Such data are almost independent of cloud cover, sunlight, and other weather conditions.For SAR image classification, backscatter intensity, polarimetric parameters, and interferometric coherence information have been mainly exploited [13].For water body extraction using a single image, SAR data has been analysed with supervised or unsupervised classification methods that include thresholding, distance-based classification, decision tree/rule-based classification, image clustering approaches, and machine learning techniques [5,[12][13][14]].Yet, threshold values and classification rules that are determined for a certain region were found to be difficult to be applied to other SAR images or regions [15], and accurately extracting flooded areas from SAR images is constrained by objects in the images that have similar reflectance values, such as roads, airports, mountainous areas, and radar shadow [16][17][18].
It was reported that combining many layers that can provide more information on targeted areas may allow for the discrimination of objects with similar backscattering values [19], and the accuracy of extracting water body from satellite data can thus be improved by the combined use of remote sensing data and other ancillary data, such as digital elevation model (DEM) products and digital topographic maps [20][21][22].Flooding potential is determined by various conditions of river basins, including the characteristics of the climatic system and drainage basin conditions [23,24].For predicting flood-prone areas by analysing spatial data, remotely sensed satellite data have been used in combination with geospatial data, such as DEM, Slope, and Aspect [2].Yet, when analysing satellite data for water body extraction for flood monitoring, it seems that such factors have not been fully reflected in the processes as a form of ancillary data.This means that the effects of using geospatial layers for water body extraction remain uncertain, and further research is thus needed.
In image classification and segmentation, previous research showed that deep learning models outperform aforementioned traditional classification methods [25][26][27][28].Deep learning methods such as convolutional neural networks (CNNs) have been widely applied for land cover classification, road extraction, ship detection, and other domains.Yet, even advanced deep learning methods have difficulties in discriminating water bodies in SAR images during image classification processes, due mainly to the misclassification of objects that have similar backscattering values.It could be assumed that the backscattering values of SAR data, which may contain insufficient information for clearly discriminating water bodies from images, could be supported by other data.Yet, only a very limited number of studies on such purpose have been conducted using SAR data [17].For deep learning-based flood monitoring models, although it was reported that geospatial datasets could be used for spatial prediction of floods using machine learning approaches [29], the actual influences of such datasets on model performance are still poorly understood.To take account of such information in the geospatial datasets, existing deep learning models need to be optimised for water body extraction from satellite data.Yet, existing research focused mainly on producing more training data or on advancing network architecture to improve classification accuracy.In addition, even existing deep learning models have not been thoroughly tested and optimised for operational flood monitoring, as most of the results are confined to analysing specific bands of available satellite images or research sites [30].Considering the existing literature, the synergistic use of geospatial data in deep learning-based water body extraction over wide areas has yet to be demonstrated.To the best of our knowledge, this is the first research that demonstrates the synergistic use of geospatial data in deep learning-based water body extraction over wide areas.The aim of this research is to present a new deep learning-based flood monitoring model that has better predictive ability by testing the effectiveness of combining geospatial layers.For that, we conducted intensive and comprehensive experiments for examining synergistic use of geospatial data for water body extraction from Sentinel-1 data using deep learning and demonstrated that the accuracy of water body extraction from Sentinel-1 data is improved by utilising such data.In the process, we also constructed a geospatial database that contains structured and unified Shuttle Radar Topography Mission (SRTM) DEM, Slope, Aspect, Profile Curvature (PC), Topographic Wetness Index (TWI), Terrain Ruggedness Index (TRI), and Buffer layers for the Southeast Asia region and presented a novel flood area monitoring model based on deep learning, which is automated and optimised to the region for operational purposes.
This paper consists of six sections.Detailed explanations on producing input data and developed methods for this research are presented in the Sections 2 and 3, and then the experimental results of evaluating the effectiveness of using geospatial layers for water body extraction are presented in Section 4. The results are discussed in the discussion section, in addition to the relationship with other research, wider implications of the research, and limitations of the research, before presenting concluding remarks in Section 6.As almost all the countries in Southeast Asia suffer from floods during the rainy season, the Southeast area region was selected as the research area for this research.Due to the limited financial resources, infrastructure, and technological means to respond to floods, the impact of floods on countries in the region tends to be more severe than for other countries [2].

Producing Input Data and Geospatial Database
Sentinel-1 data and United Nations Satellite Centre (UNOSAT) flood datasets for the region were used as the main input data (Figure 1).UNOSAT provides analysed flood boundaries and flood extent in the shapefile spatial data format during and after flood events [31,32].The UNOSAT flood datasets were produced with a thresholding method and validated through manual visual inspection and modification, which are freely available through the UNOSAT flood portal for various purposes (http://floods.unosat.org/geoportal/catalog/search/search.page,accessed on 18 September 2020), including flood model calibration and supporting post-disaster field assessments.Based on the locations and dates of the flood data, corresponding Sentinel-1 images were obtained from the Copernicus Open Access Hub (https://scihub.copernicus.eu/dhus/#/home,accessed on 21 September 2020).
Since empirical experiments showed that classification accuracy was not significantly improved by using other SAR information, such as VH polarisation and incident angle, as additional bands, and that VV polarization showed higher accuracy than VH polarization in water body classification, VV band was selected as the main input satellite data [33].To use the data for extracting water bodies from multiple Level-1 Ground Range Detected (GRD) Sentinel-1 images, which were acquired with interferometric wide (IW) mode at a 20 m × 5 m spatial resolution, digital numbers of the images were converted into Sigma0 by performing radiometric calibration [17].Applied pre-processing procedures of the Sentinel-1 images include 'Remove GRD border noise', 'Radiometric Calibration' (VV), 'Speckle Filtering', and 'Terrain Correction'.Therefore, the output SAR data that were pre-processed for further analyses have amplitude values in linear scale, with pixel spacing of 10 m × 10 m.A total of 50 scenes of Sentinel-1 images acquired between 2015 and 2018 were downloaded and pre-processed for visual inspection, which correspond to the dates and locations of the vector data of the UNOSAT flood datasets (Figure 1d).After completing the process, 30 scenes were finally used as input data to train and validate deep learning models.In accordance with geographical features and the locations of ground truth data, three areas were selected for accuracy assessment, i.e., Padma river basin (A) in Bangladesh, Lower Mekong river basin (B) in Vietnam, and Chindwin river basin (C) in Myanmar, which are reported to often experience flash floods during rainy seasons and have different topographical features.For reliable evaluation of model performance and segmentation results, of the 30 Sentinel-1 scenes, three scenes for the three regions were used for inference and validation (No.1-3 in Table 1).Although the original UNOSAT flood extent dataset has been verified by intensive data cleaning and manual visual inspection, some mismatching between the satellite data and the flood extent vector data was found during the labelling processes.Among the initial 30 input data pairs, only 12 input data pairs were selected by intensive visual interpretation.In the following cases, we excluded data pairs from input data: (1) the extent of an SAR image and label data was not matched; (2) the quality of the label data was low, as, for instance, water and non-water at flat beaches were mislabelled.We modified and included some data pairs into input data for the following cases: (1) the label was accurate but the ocean was not labelled as water.For this case, we modified the ocean part into water, based on the ideas of sea masking.(2) Label shapefiles only exist for some parts of SAR images.In such cases, the extent of SAR images was wider than that of label data, so stacked SAR images were cropped on the basis of the extent of label shapefiles.Lastly, rasterizing label shapefiles with the same spatial resolution of VV images was needed.We designed a code that created an empty field, which had the same area and a spatial resolution with Sentinel-1 VV band images, and executed in a way where pixels overlapped on shapefiles had meaningful values.Therefore, corresponding flood extent boundaries in the shapefile format were converted into binary raster data of 0 (non-water) and 1 (water), which are used as label data for training and ground truth data for validation.

Building a Geospatial Database
Geospatial data have been used to predict floods, manage flood emergencies, and produce flood-related maps, including floods risk and susceptibility maps [29,34].Relying on the wealth of literature in hydrology and remote sensing sciences, geospatial layers that have a possibility of providing topo-hydrographic information were selected and produced for deep learning-based water body extraction [35][36][37][38], which had been evaluated to be related to the feature of flooded areas [39] and to be geographical variables having influence on river flood occurrence [19,23].These layers include Digital Elevation Model (DEM), Slope, Aspect, Terrain Ruggedness Index (TRI), Profile Curvature (PC), Topographic Wetness Index (TWI), and 'distance from water bodies' (i.e., Buffer), which were expected to improve discriminating land surface and terrain effects that are caused by various physical characteristics (Figure 2).To produce geospatial layers as additional input layers, 1-ArcSecond Global Shuttle Radar Topography Mission (SRTM) DEM tiles, which were freely available through the USGS webpage (https://earthexplorer.usgs.gov/,accessed on 22 September 2020), were downloaded, mosaicked, gap-filled, and exported in the EPSG:4326 (WGS 84 latitude/longitude) coordinate at 1-arcsecond (around 30 m) spatial resolution.Using the mosaicked DEM layer, Slope, Aspect, and PC layers were produced at the same spatial resolution and coordinates.
In addition, TRI, which indicates terrain heterogeneity, and TWI layers were also generated based on [40][41][42][43], respectively.To produce a Buffer layer, which shows the proximity to rivers, digital topographic maps for the research area were merged, and then the water layers were extracted based on the attributes of the vector data.Before stacking up, the Buffer layer required an additional processing step, as the produced Buffer rings around river mouths protruded into coast lines.We masked out sea to prevent errors around the mouths of rivers.The size of each layer, which was produced for the extent of the whole Southeast Asia region, is around 50 Gb, and the size of all the layers saved in a geospatial database is thus around 400 Gb, excluding satellite images and label data.

Deep Learning-Based Water Body Extraction Model for Operational Flood Monitoring
For this research, a deep learning-based water body extraction model for operational flood monitoring across the Southeast Asia region was presented, as shown in Figure 3.The model consists of four steps, including (a) producing input data and pre-processing, (b) stacking and matching input data, (c) semantic image segmentation, and (d) accuracy assessment (Figure 3).The first step is explained above, and detailed methods for steps two, three, and four of the model are explained in the following sections.Using the model, all 128 cases that were the possible combinations of satellite data and geospatial layers were examined to evaluate the effectiveness of adding ancillary layers and to evaluate model performance and segmentation accuracy.

Customisation and Optimisation of the Deep Neural Network
For operational flood monitoring by extracting flood damage information from satellite data, Convolutional Neural Network (CNN)-based deep learning methods may not be effective if spatial resolutions of input satellite data are lost in the process of downsampling with pooling layers.For this reason, instance segmentation is required to perform fine-grained inference, and by adopting fully connected layers, semantic segmentation based on fully convolutional networks (FCNs) was able to achieve pixel-wise labelling [45].Yet, the FCNs have limitations in achieving highly accurate pixel-wise class labelling, due to difficulties in reconstructing the non-linear structures of object boundaries [28,46].
As the main purpose of this research is to present a reliable deep learning-based flood monitoring model that has better predictive ability by testing the effectiveness of geospatial layers, for this research, the U-Net architecture, which was developed for semantic segmentation with a relatively small amount of training data [44,47] and has thus been widely used to classify urban features with shorter training times and to minimise the loss of spatial resolution [48], was customised and optimised to utilise Sentinel-1 data and seven different types of geospatial layers.Unlike the original U-Net, our model can take geo-located and multi-layered SAR images and other ancillary data in the GeoTiff format as input data, and the optimised deep network presented for this research does not lose any spatial resolution or location information of the multi-modal input data.In addition, the size of the input data for inference is significantly larger than the U-Net to reduce processing times and to achieve the seamless merging of segmented image patches for inference.
The architecture of the model for semantic image segmentation consists of 18, 3 × 3 2D convolution layers, which have the Rectified Linear Unit (ReLU) as an activation function, and one 1 × 1 2D convolution layer, which has Sigmoid as an activation function.The convolution layers in the contracting and expanding paths are followed by four 2 × 2 2D max_pooling layers and four 2 × 2 up-sampling layers, which perform nearest interpolation.The contracting and expanding processes are combined by concatenation, and padding is added to the convolution layers to preserve the spatial resolution of input data.The number of total trainable parameters for the model is 31,379,521.The architecture of the model is presented in Figure 3c, and the hyper-parameters for training models are explained in detail in Section 3.1.3.

Stacking Input Data for Matching Layers and Normalisation
The last procedure to generate georeferenced input data for model training was stacking eight separate pieces of data into one single file with eight layers and normalising layer values.Since the geospatial layers were georeferenced images at various pixel sizes, geographic extents, datatypes and data formats, and deep learning models are trained based on the information in the pixels, the main points of stacking input data were matching such factors.The geospatial layers were georeferenced to the WGS 84 latitude/longitude coordinate system but had different pixel spacing and extents.Therefore, it was necessary to combine the other layers into a multi-band single image, and the first layer determined the output size and extent of the layer stack as a reference layer.To achieve this, we clipped input layers to input SAR data extent and stacked using Geospatial Data Abstraction Library (GDAL) libraries.In order to do that, we firstly defined the common grid from the input SAR data to which the auxiliary layers will be resampled and reprojected.Secondly, each auxiliary layer is clipped with those common extents.Following that, we resampled the geospatial layers that have different spatial resolutions to the target resolution and different extents to the target extent (here, input SAR data extent and resolution).To accurately clip without the influence of raster properties, a new python code was devised using a shapefile made with the VV raster pixel coordinate.As SAR images were preprocessed at 10 metre pixel spacing, the other bands were also interpolated at the same pixel size.Finally, the resampled datasets were stacked into a single dataset with separate eight bands using the gdal_merge algorithm.Through the procedure, the final input data that consist of eight raster layers with the same pixel size and coordinate system were produced for training (Figure 4a).To train deep neural networks for semantic segmentation, all of the satellite images and geospatial data were normalised and standardised.To obtain more accurate models and reduce processing time, we reclassified all of the input layers into values between 0 and 1 by considering standard deviation and the histograms of the layer values.The normalisation was performed based on more than 300 times of experiments conducted to evaluate the effects of normalisation.To remove speckle values, the values of the VV layer that get out of 0 to 1 were adjusted into 0 and 1.The values of Slope and Aspect had a range by definition, so we divided the values by theorical maximum value to make a range of 0 to 1. Therefore, VV, Slope, and Aspect layers contain continuous values between 0 and 1.Based on the real value ranges of the whole Southeast Asia region, the other bands were reclassified into discrete values of 0 to 1 (Table 2) after evaluating the effects of employing discrete values.The stacked dataset that was matched and normalised is exported to a database in geotiff format to be analysed with the deep learning algorithm as described in the following section.

Model Training
Deep learning models were trained from scratch using the stacked datasets produced for this research.Before starting model training, band extract was performed.Combinations of geographic information bands were automatically selected when the number of elements was set as the input for systematic evaluation.When a combination to test is chosen, stacked input data were copied only with a VV band and selected geospatial bands (Figure 4b).Copied and stacked images were cropped into 320 × 320 pixels (Figure 4c) and saved only if the border of the VV image were not included.In addition, rasterized label data were cropped into 320 × 320 pixels and saved only if the proportion of water pixels was between 10% and 90% (Figure 5).We matched cropped stacked images to corresponding cropped label images and filtered out based on the two conditions.The final number of pairs of SAR images and corresponding label images was 4326.The customised and optimised U-Net for water body extraction was trained for all of the possible combinations of the Sentinel-1 images and geospatial layers.For deep learning, hyper-parameters need to be tuned, which could often be set through heuristic ways, and it thus requires repetitive empirical tests.Repetitive systematic experiments were performed to decide optimal hyper-parameters for the model with minimum loss.The selected hyperparameters are as follows (Table 3): The activation function used for each layer is RELU, and the activation function for the output layer is sigmoid.The kernel size of the convolution layer is 3 × 3, 2 × 2, which is used for upsampling and maxpooling layers, and 1 × 1 is used for the output convolution layer.To maintain the size of the input layer, the stride is fixed to 1 × 1 and the same padding is used.Adadelta is used as an optimiser with an initial learning rate of 1 and a decay rate of 0.95.Patch size/channels 320 × 320/1-8 Pair numbers/Water body rate 4326/0.1-0.9 Among 4326 input pairs, we randomly split them into three sets, training, validation, and test datasets [49].The selected ratio of dataset split is 60%, 20%, and 20%.A minibatch size for training is 16 patches and iterated over the whole training dataset 170 times.To prevent overfitting, and to minimise training time, the early stopping function was adopted in the training process.When validation loss does not improve for five continuous epochs, the weights of the model with the minimum validation loss value are automatically saved for the segmentation of new Sentinel-1 images.

Inference
For prediction, the same preprocessing and band extraction procedures were applied to inference data.As mentioned in the model training section, combinations of geospatial bands were automatically selected and copied into a separate folder.Those copied images were cropped into patches and reclassified based on the same criteria for training input data.The patches were predicted into binary outputs using trained models.Values of outputs are 1 and 2, which are defined as non-water and water, respectively.
Meanwhile, VV images without other geospatial bands were copied into a result folder on which the predicted output was overlaid.The purpose of this procedure is to make the output georeferenced and to remove the borders of VV images.As SAR images are usually inclined rectangular shape, there are margins that have no information.Cropped patches had no geographic coordinates but predicted in the same way whether they were the part of the border or not.The cropped output combined in an order and overlaid on copied VV images only if the pixels in the patches were not part of margin.As a result, the outputs had coordinates information that was predicted with meaningful data.
A new code was developed to improve the inference procedure, which includes modifying: (1) the size of cropping for inference data and (2) the padding size for combining cropped patches.First, the size of cropping for inference data is different from training data.The size for trained patches is 320 × 320 pixels, but that for inference patches is 3040 × 3040 pixels.The reason for the different patch sizes is that increasing the patch size leads to a reduction in inferencing time.More importantly, there was an essential precondition to increase patch size that the quality of inference should be maintained.We tested various cropping sizes from smaller to bigger than that for training and verified that increasing size does not interrupt the quality of outputs through visual interpretation and evaluating numerical indicators, such as accuracy, precision, IOU, recall, and F1 score.Second, for mosaicking predicted patches, the concept of overlapping of patches was not used.Without padding, inference time became shorter and duplication errors on borders were completely removed.
All experiments for testing the effects of geospatial data on image segmentation and the training and validating of the deep neural network were conducted with a GPU server that has four Nvidia GeForce RTX 3090 GPUs, which have 24 Gb memories and a 260 Gb RAM.The server also has 72 Intel(R) Xeon(R) Gold 6240 CPU @ 2.60 GHz CPUs, one SSD, and one 11TB HDD.The versions for NVIDIA driver, CUDA, and Python are 470.57.02, 11.4, and 3.9.4,respectively.

Accuracy Assessment
The performance of deep learning architectures can be evaluated with criteria such as Overall Accuracy (OA) of pixel-wise classification, time, and memory usage [28].For evaluating image classification accuracy, OA has been commonly used, which is the proportion of correct predictions among the total number of predictions.Although, for supervised learning, using a confusion matrix for evaluating classification accuracy is common as a statistical indicator, accuracy metrics could mislead if the class representation for evaluation is unbalanced [28].Therefore, in addition to OA, precision, recall, mean intersection over union (IOU), and F1 score were selected for a more precise model and inferenced output evaluation [2,48].Precision is the proportion of correct water pixels among the predicted water pixels, while Recall is the proportion of correct water pixels among the correct predictions.Mean IOU indicates the degree to which predicted bounding boxes are overlapped on ground truth bounding boxes.F1 score is derived with the 'harmonic mean of precision and recall' (see, [50]).The confusion table used in this research and the mathematical formulas for the five criteria are shown in Table 4.

Formulas for Accuracy Assessment of Output Images
Overall accuracy (OA) Trained models and segmentation accuracy were evaluated with those five-confusion matrices and the binary cross-entropy confusion function.As aforementioned, input data for training were divided into train: validation: test by a ratio of 6:2:2.Testing data were used to calculate the values of loss and the confusion matrix to show how the model is well-trained.The mathematical formulation of binary cross entropy is: where, C is the cross entropy, n is the total number of test data, y is the desired value, and a is the predicted value.The reason we used cross entropy instead of mean square error (MSE) is that MSE is a more time-consuming confusion function than cross entropy [51].
For inference evaluation, we relied on the same principle, i.e., the confusion matrix, but by random sampling.As the three images for inference are composed of about 20,000 × 30,000 pixels, if the confusion matrix calculation is conducted on a pixel-by-pixel basis between ground truth data and prediction results, the process is a time-consuming task.As a shorter run time was required to efficiently compare between various band combinations and operational flood monitoring, we adopted a random sampling method for the calculation.Sample size was calculated on each inference datum, satisfying the conditions of a 99% confidence level, the observed percentages of 0.5 and 0.01 margins of error.From the population size, we randomly selected pixels with python code developed for this task and only evaluated selected pixels.To avoid the effect of margin areas, the number of total calculated pixels is far bigger than sample size.We checked the values of the entire calculated confusion matrix, and the randomly sampled calculated confusion matrix was statistically the same.In addition to the quantitative accuracy assessment, intensive visual inspection was performed to evaluate the quality of output images at various scales [28].

Segmentation Results and Improved Cases
All of the 128 cases that had been determined by the combination of stacked layers were evaluated through the training of models and the inference of the three Sentinel-1 images for image segmentation; therefore, the number of total Sentinel-1 images segmented for this research is 384.Of the 128 cases being tested, 31 cases showed improvement in Overall Accuracy (OA), and 19 cases showed improvement in both averaged IOU and F1 score for the three images segmented for water body extraction, as shown in Figure 6.Most of the cases that consist of six, seven, and eight layers showed lower OA, IOU, and F1 scores, compared to those of the Sentinel-1 VV band image.In Figure 6, the numbers under x-axes indicate band combinations consisting of 1-VV, 2-DEM, 3-Slope, 4-Aspect, 5-PC, 6-TWI, 7-Buffer, and 8-TRI, and detailed information on stacked geospatial layers are in Table 1 above.
The training accuracy for model performance and the averaged inference results of the three Sentinel-1 images for the 19 cases are shown in Table 5.For training, the Overall Accuracy (OA), IOU, and F1 score of the Sentinel-1 VV band image are 94.91, 87.83, and 93.52, respectively, and those for inference are 95.77,80.35, and 88.85, which were evaluated based on comparing sampled pixels of the output to corresponding ground truth data.Of the 19 cases, 'band combination 1358' (VV, Slope, PC, and TRI), and 'band combination 1357' (VV, Slope, PC, and Buffer) showed the best inference accuracy.The OA, IOU, and F1 score of 'band combination 1358' are 96.73,85.42, and 92.08, and those of 'band combination 1357' are 96.89,85.85, and 92.31, respectively.Compared to the Sentinel-1 VV band, 'band combination 1358' (VV, Slope, PC, and TRI) showed improvement in segmentation accuracy by 0.96, 5.07, and 3.23 in the three criteria.

Improvement in Inference Accuracy of the Three Cases
The F1 score of the 19 band combinations of the scenes A (Padma river basin in Bangladesh), B (Lower Mekong river basin in Vietnam) and C (Chindwin river basin in Myanmar), and the differences in F1 score between the 19 band combinations of those scenes and the Sentinel-1 VV band images are presented in Table 6, in addition to averaged OA, precision, recall, IOU, and F1 score for the 19 cases.The results show the improvement of stacked images combining geospatial layers in segmentation accuracy compared to that of the VV band images.For Scene A, minor improvements (up to 4.25) were observed, whereas a minor decease (−2.46) in F1 score was observed in the difference between B-VV.'Band combination 1358' (VV, Slope, PC, and TRI) for Scene C showed improvement by 7.68 compared to the segmentation output of the Sentinel-1 VV band image for the same area.Of the 19 cases, 4 cases (band combination 134 (VV, Slope, and Aspect), 1357 (VV, Slope, PC, and Buffer), 1358 (VV, Slope, PC, and TRI), and 1578 (VV, PC, Buffer, and TRI)) showed improvements in F1 score of all of the three scenes compared to that of the Sentinel-1 VV band images.The segmentation results of 'band combination 1358 and 1357' for the three areas, which have different topographical features, are presented in Figure 7 for comparison.Scene C (Chindwin river basin in Myanmar), which contains mountainous areas, showed lower segmentation accuracy compared to that of Scenes A and B, but they also showed the highest improvement in segmentation accuracy.

Visual Interpretation
To evaluate semantic image segmentation accuracy at more detailed levels, visual interpretation was conducted for the three segmented Sentinel-1 images.Some examples of the evaluation of water body extraction results for the C-Chindwin river basin ('band combination 1358'-VV, Slope, PC, and TRI) are presented in Figure 8. Enlarged images (a)-(d) in Figure 8 show: (a) Sentinel-1 images, (b) Label data, (c) Segmentation result of VV band, and (d) Segmentation result of 'band combination 1358'.As shown in the dotted boxes in red in the output images, and compared to the segmentation result of VV bands, the segmentation result of 'band combination 1358' showed a significant reduction in mountain shadows, which are one of main sources of misclassification.As segmentation accuracy was improved and the terrain effects in the Sentinel-1 images and output images were reduced, it can be said that the results of the qualitative accuracy assessment through visual inspection is consistent with that of the quantitative accuracy assessment.Such improvement was observed in other output images that were produced based on different band combinations and showed improvement in segmentation accuracy that was evaluated through quantitative assessment.Training time was rather decreased as the number of bands was increasing, while inference time was gradually increased as the number of bands was increasing.It was assumed that the decreasing training time is because of the 'early stopping function' that was adopted for training and that the increasing inference time is because of the size of input data for inference that is in proportion to the number of bands.It was shown that although the inference time for water body extraction was increased by adding geospatial layers, it is still acceptable for operational flood monitoring, even when geospatial layers are added to the Sentinel-1 VV band as ancillary data.

Summary and General Discussion
Considering the existing previous research, it is clear that the disciplines most directly concerned with flood monitoring using satellite data, disaster management, or remote sensing science have not fully examined how flooded areas can be extracted with the sateof-the-art technique for image classification, i.e., deep learning.To see if our assumption that deep learning-based water body extraction could be improved by using geospatial layers as additional input layers is valid, we advanced an existing deep learning model by customising and optimising its network and processing procedures for more accurate and faster image segmentation.
Through the experiment, a novel water body extraction model based on a deep neural network that exploits Sentinel-1 data and flood-related geospatial datasets was presented for flood monitoring across the Southeast Asia region.For the model, the U-Net was customised and optimised to utilise Sentinel-1 data and other flood-related geospatial data, including digital elevation model (DEM), Slope, Aspect, Profile Curvature (PC), Topographic Wetness Index (TWI), Terrain Ruggedness Index (TRI), and Buffer, in GeoTiff format for the Southeast Asia region.The main features of our deep neural network for water body extraction from Sentinel-1 images are: (1) our model can take geo-located and multi-layered SAR images and other ancillary data in GeoTiff format as input data, (2) the optimised deep network presented for this research does not lose any spatial resolution and location information of the multi-modal input data, and (3) the size of input data for inference is significantly larger than the U-Net to reduce processing time and to achieve the seamless merging of segmented image patches for inference.
To test and validate the water body extraction model, it was applied to three areas in Vietnam, Myanmar, and Bangladesh, and model performance and segmentation accuracy for all of the 128 cases that had been determined by the combination of stacked layers were evaluated in accordance with the types of combined input layers.Therefore, the number of total Sentinel-1 images segmented for this research is 384.Of the 128 cases tested in this research, 31 cases showed improvement in Overall Accuracy (OA), and 19 cases showed improvement in both averaged IOU and F1 score for the three images classified for water body extraction.Most cases that consist of six, seven, and eight layers showed lower OA, IOU, and F1 score compared to those of the Sentinel-1 VV band image.The averaged OA, IOU, and F1 score of the Sentinel-1 VV band are 95.77,80.35, and 88.85 respectively, whereas those of 'band combination 1358 (VV, Slope, PC, and TRI)' are 96.73,85.42, and 92.08, showing improvements in all of the criteria for accuracy assessment.The degrees of improvement of the three criteria are 0.96, 5.07, and 3.23, respectively.The improved segmentation accuracy of 'band combination VV, Slope, PC, and TRI' showed a higher OA and F1 score compared to other Sentinel-1-based flood monitoring models [33] or deep learning-based flood monitoring models [17,30,49].In addition, the averaged processing time, i.e., training and inference time for a Sentinel-1 image, of the 'band combination VV, Slope, PC, and TRI' is greatly shorter than that of [17,33].
Such improvement was clearer in the water body extraction results for the C-Chindwin river basin, which contains mountainous areas.For the image, quantitative evaluation of 'band combination 1358' (VV, Slope, PC, and TRI) showed an improvement in F1 score by 7.68 percent compared to the segmentation output of the Sentinel-1 VV band, and it was also demonstrated through visual interpretation.As segmentation accuracy was improved and the terrain effects in the Sentinel-1 images and output images were reduced, the results of the qualitative accuracy assessment through visual inspection is consistent with that of the quantitative accuracy assessment.To the best of our knowledge, this is the first study that demonstrates the synergistic use of geospatial data in deep learning-based water body extraction over wide areas.

Novelty, Limitations, and Future Work
The main purpose of this research is to present a reliable deep learning-based flood monitoring model that has better predictive ability by testing the effectiveness of geospatial data.For this research, the U-Net architecture was customised and optimised to utilise Sentinel-1 data and seven different types of geospatial layers.Through the research, it was demonstrated that the accuracy of deep learning-based water body extraction can be improved by using geospatial data, and based on the experiment, a new water body extraction model is presented for flood monitoring across the Southeast Asia region.While previous studies focused on producing more training data or advancing network architectures to improve image classification accuracy, we focused rather on utilising available flood data and flood-related geospatial data and demonstrated our assumption that deep learning-based water body extraction can be improved by using geospatial layers as additional input layers.
Although it was demonstrated that deep learning-based water body extraction can be improved by exploiting geospatial layers, it does not mean that classification performance is always improved by using geospatial layers, and the result of this research is applicable to other existing deep neural networks without testing its applicability and transferability.As per the research aim of this study, this research is confined to evaluating satellite data and available geospatial layers.To derive more reliable water body extraction models for flood monitoring, more geospatial layers and non-geospatial data need to be tested, and the possibility of reducing misclassification of other factors, such as roads and airports, needs to be verified to achieve better classification accuracy.

Conclusions
Floods occur unexpectedly and cause devastating damage over broad areas.Yet the timely monitoring and estimation of inundated areas using satellite data has been limited by satellite data acquisition and classification accuracy.Although deep learning is a promising method for satellite image classification, the synergistic use of geospatial data for water body extraction from Sentinel-1 data using deep learning and the applicability of existing deep learning models have not been thoroughly tested for operational flood monitoring.To fill the knowledge gap, a novel water body extraction model was presented based on a deep neural network that exploits Sentinel-1 data and flood-related geospatial datasets, including digital elevation model (DEM), Slope, Aspect, Profile Curvature (PC), Topographic Wetness Index (TWI), Terrain Ruggedness Index (TRI), and Buffer for the Southeast Asia region.For the model, the U-Net was customised and optimised to utilise Sentinel-1 data and other flood-related geospatial data in GeoTiff format for operational flood monitoring in the Southeast Asia region.The testing and validation of the water body extraction model was applied to three Sentinel-1 images for Vietnam, Myanmar, and Bangladesh.Model performance and segmentation accuracy for all of the 128 cases that the combination of stacked layers had determined were evaluated following the types of combined input layers.
Through this research, it was demonstrated that the accuracy of deep learning-based water body extraction can be improved up to 7.68 percent by using geospatial data, and based on the experiment, a new water body extraction model that is further verified through visual inspection and the evaluation of model performance, including training and inference time, is presented for operational flood monitoring across the Southeast Asia region.As per the research aim of this study, this research is confined to evaluating satellite data and available geospatial layers.To derive more reliable water body extraction models for operational flood monitoring, more geospatial layers and non-geospatial data need to be tested, and the possibility of reducing misclassification of other factors, such as roads and airports, needs to be verified to achieve better classification accuracy.

2 . 1 .
Pre-Processing and Modification of Input Data 2.1.1.Sentinel-1 and Ground Truth Data for the Southeast Asia Region

Figure 1 .
Figure 1.Sentinel-1 images used in this research and an example of UNOSAT flood data ((a) 50 scenes of Sentinel-1 images acquired between 2015 and 2018 for visual inspection, (b) 30 scenes of Sentinel-1 SAR images for training deep learning networks and evaluating model performance, (c) 3 scenes (A-C) for inference and validation, (d) flood data in shapefile format (yellow) overlaid on a Sentinel-1 image).

Figure 3 .
Figure 3. Deep learning-based water body extraction model.The model consists of (a) producing input data and pre-processing, (b) stacking and matching input data, (c) image segmentation with a deep neural network, and (d) accuracy assessment.The architecture of the deep learning model in the figure is built on U-Net [44].

Figure 4 .
Figure 4. Producing and matching input data for model training and inference ((a) stacking geospatial layers for matching, (b) extracting the stacked geospatial layers for Sentinel-1 images (around 30,000 × 20,000), and (c) producing 320 × 320 × 8 images for training and inference).

Figure 5 .
Figure 5. Examples of cropping, selecting, and pairing training data for testing the flood monitoring model developed for this research (452,000 pairs of label data (top) and Sentinel-1 image (bottom) patches were initially produced, and part of them were excluded based on the principle of water rate and the existence of borders in image patches).

Figure 6 .
Figure 6.Averaged Overall Accuracy (OA), IOU, and F1 score of all of the 128 cases being tested.

Figure 7 .
Figure 7. Segmented images of the three cases described in Figure 1c (A-Padma river basin in Bangladesh (first column), B-Lower Mekong river basin in Vietnam (second column), and C-Chindwin river basin in Myanmar (third column); the first-second row show the Sentinel-1 image and label data for the sites, and the third-last row show the classification results of VV, 'band combination 1358' (VV, Slope, PC, and TRI), and 'band combination 1357' (VV, Slope, PC, and Buffer) images of the corresponding site).

Figure 8 .
Figure 8. Segmentation results of C-Chindwin river basin (see the dotted boxes in red in the output images; (a) Enlarged Sentinel-1 images, (b) Label data, (c) Segmentation result of the Sentinel-1 VV band, (d) Segmentation result of 'band combination 1358' (VV, Slope, PC, and TRI)).

For
operational flood monitoring through deep learning-based water body extraction, the training and inference times for the selected 19 cases are presented in Table 7.The training and inference times of the VV band were 1404.95 and 302.20 s, whereas those of the 'band combination 1358' (VV, Slope, PC, and TRI) were 590.25 and 847.41 s, respectively.The averaged training time by the number of bands was: 3 bands-1008.44,4 bands-1020.91,and 5 bands-1150.58s.Whereas the averaged inference time by the number of bands was: 3 bands-627.38,4 bands-856.19,and, 5 bands-1009.91 s.

Table 1 .
Examples of Sentinel-1 images for training and inference of deep learning models.

Table 2 .
Matching spatial resolution and normalisation of Sentinel-1 VV and geospatial layers.

Table 3 .
Hyper-parameters for training the deep learning models and inference.

Table 4 .
Criteria and equations for pixel-wise evaluation and accuracy assessment for output images.