Near-Real-Time Flood Mapping Using Off-the-Shelf Models with SAR Imagery and Deep Learning

: Timely detection of ﬂooding is paramount for saving lives as well as evaluating levels of damage. Floods generally occur under speciﬁc weather conditions, such as excessive precipitation, which makes the presence of clouds very likely. For this reason, radar-based sensors are most suitable for near-real-time ﬂood mapping. The public dataset Sen1Floods11 recently released by the Cloud to Street is one example of ongoing beneﬁcial initiatives to employ deep learning for ﬂood detection with synthetic aperture radar. The present study used this dataset to improve ﬂood detection using well-known segmentation architectures, such as SegNet and UNet, as networks. In addition, this study provided a deeper understanding of which set of polarized band combination is more suitable for distinguishing permanent water, as well as ﬂooded areas from the SAR image. The overall performance of the models with various kinds of labels and a combination of bands to detect all surface water areas were also assessed. Finally, the trained models were tested on a completely different location at Kerala, India, during the 2018 ﬂood for verifying their performance in the real-world situation of a ﬂood event outside of the given test set in the dataset. The results prove that trained models can be used as off-the-shelf models to achieve an intersection over union (IoU) as high as 0.88 in comparison with optical images. The omission and commission error were less than 6%. However, the most important result is that the processing time for the whole satellite image was less than 1 min. This will help signiﬁcantly for providing analysis and near-real-time ﬂood mapping services to ﬁrst responder organizations during ﬂooding disasters.


Introduction
The importance of surface water mapping can be understood by studying UN Sustainable Development Goals (SDGs), in which as many as four goals directly mention surface water monitoring, including food security (target 2.4), water-related ecosystem management (targets 6.5 and 6.6), and the effect on land (target 15.3). However, the most relevant target in this study is target 11.5 under goal 11 (sustainable cities and communities), which states "By 2030, significantly reduce the number of deaths and the number of people affected and substantially decrease the economic losses relative to the gross domestic product caused by disasters, including water-related disasters, with a focus on protecting the poor and people in vulnerable situations" [1]. In this context, near-real-time (NRT) flood mapping becomes very necessary.
Because flooding is a large-scale phenomenon and with the improved spatial, temporal, and radiometric resolution of satellite images, remote sensing becomes the obvious choice for flood mapping [2]. There are many works related to the extraction of surface water information, including floods, varying from different sensor types to different methods [3]. Huang et al., 2018 [3], mentioned that the number of related works with "surface well as validation data generation on the test site and the performance measures used in the study, are discussed. Finally, the results of the different models' performance are discussed using detailed illustrations and explanation. The trained models, validation data used in this study, and results are uploaded at https://sandbox.zenodo.org/record/764863. The source code is available from the authors upon reasonable request.

Dataset Description and the Test Area
This study used the Sen1Floods11 dataset that was released during the 2020 Computer Vision and Pattern Recognition Workshop [23] and generated by Cloud to Street, a public benefit corporation. Details of the dataset are given below.
The dataset is divided into two parts, one containing data related to flood events and another for permanent bodies of surface water. The permanent water data include images from the Sentinel-1 satellite constellation and corresponding labels from the European Commission Joint Research Centre (JRC) global surface water dataset. We mainly used the flood events dataset in this study, which has two types of labels: weakly labeled and hand labeled. Weakly labeled here means that the labels have not been checked for quality, as they were generated through semi-automated algorithms that use certain thresholds to separate water and non-water areas. The weakly labeled data have two kinds of labels generated from Sentinel-1 and Sentinel-2 images, respectively. These labels are binarized images containing ones (for water pixels) and zeros (for non-water pixels). Sentinel-1 weak labels were prepared using the Otsu thresholding method over the focal mean-smoothed VH band. For creating the weak labels from Sentinel-2 images, expert-derived thresholds of 0.2 and 0.3 were applied over the normalized difference vegetation index (NDVI) and MNDWI bands, respectively. These weakly labeled data have not been quality controlled and over-or under-segmentation is possible. The hand-labeled data were created using information from overlapping tiles of both Sentinel-1 and Sentinel-2. The manual classification was performed using the Sentinel-1 VH band and two false-color images of Sentinel-2 (RGB: B12, B8, B4 and B8, B11, B4) that highlight the water areas in the optical images. The resultant labels are more accurate and have three values in the output: 1 (water pixels), 0 (non-water pixels), and −1 (clouds or cloud shadows).
Overall, 4830 non-overlapping chips were available to us that belong to flood events of 11 countries. Of these, 4385 chips are weakly labeled with corresponding S1Weak (Sentinel-1) and S2Weak (Sentinel-2) labels, while 446 chips are hand labeled and have corresponding quality-controlled labels. Each chip size is 512 × 512 pixels. All chips have overlapping Sentinel-1 and Sentinel-2 images. Sentinel-1 chips were created using dual-polarized Sentinel-1 ground range detected (GRD) images. As these images have been downloaded from the Google Earth Engine, each image was pre-processed using the Sentinel-1 Toolbox by the following steps: thermal noise removal, radiometric calibration, terrain correction using SRTM 30, and finally conversion of both bands' values into decibels via log scaling. In contrast, Sentinel-2 chips are from raw Sentinel-2 MSI Level-1C images having all 13 bands (B1-B12). The 13 spectral bands represent the top of atmosphere reflectance, scaled by 10,000.
The hand-labeled data were split into three parts with a ratio of 60:20:20 into training, validation, and test sets. In contrast, all the weakly labeled data were used for training purposes only. In this way, the test set and validation set remained the same throughout, while training data could be changed according to our requirements, and we could do cross-comparison for different kinds of training data.

Test Area
The study area that was chosen for applying the off-the-self model is in the southern Indian state Kerala, as shown in Figure 1. In 2018, an especially devastating flood occurred in Kerala; this flood took more than 400 lives and affected millions more. The study area that was chosen for applying the off-the-self model is in the southern Indian state Kerala, as shown in Figure 1. In 2018, an especially devastating flood occurred in Kerala; this flood took more than 400 lives and affected millions more.  Figure 1 highlights the worst-affected districts in Kerala; western districts faced much more severe flooding than eastern districts because western districts are topographically flat (coastal plains). As mentioned in Table 1, four Sentinel-1 images of the affected area on the same date of 21 August 2018 were selected for testing. Two images were acquired during the ascending flight direction and two were acquired during the descending flight direction. The closest Sentinel-2 image of the same area is available for 22 August 2018. However, most of the area in this image has clouds. So finally, only the area belonging mainly to the Alappuzha district was selected because this image has no or very few pixels affected by clouds in the Sentinel-2 image. This was done to validate the detection from the Sentinel-1 image. In general, reference flood mask is generated using aerial images [20,24] or optical images such as Worldview [25], Sentinel-2 [26,27]. Therefore, authors adopted using Sentinel-2 image, which previously were successfully utilized for flood mapping [28], on their own and as a flood reference mask to validate the results [26,27]. To make a reference water mask from the Sentinel-2 image, MNDWI, false color composite using bands B12, B8, and B4 and the true-color composite using B4, B3, and B2 bands were used, along with this visual inspection was performed to maintain the accuracy of the mask.  Figure 1 highlights the worst-affected districts in Kerala; western districts faced much more severe flooding than eastern districts because western districts are topographically flat (coastal plains). As mentioned in Table 1, four Sentinel-1 images of the affected area on the same date of 21 August 2018 were selected for testing. Two images were acquired during the ascending flight direction and two were acquired during the descending flight direction. The closest Sentinel-2 image of the same area is available for 22 August 2018. However, most of the area in this image has clouds. So finally, only the area belonging mainly to the Alappuzha district was selected because this image has no or very few pixels affected by clouds in the Sentinel-2 image. This was done to validate the detection from the Sentinel-1 image. In general, reference flood mask is generated using aerial images [20,24] or optical images such as Worldview [25], Sentinel-2 [26,27]. Therefore, authors adopted using Sentinel-2 image, which previously were successfully utilized for flood mapping [28], on their own and as a flood reference mask to validate the results [26,27]. To make a reference water mask from the Sentinel-2 image, MNDWI, false color composite using bands B12, B8, and B4 and the true-color composite using B4, B3, and B2 bands were used, along with this visual inspection was performed to maintain the accuracy of the mask.

Networks and Hyperparameters
DCNNs are composed of cascades of layers, executing mainly three important operations, convolution, downsampling and upsampling. The convolution layer uses a local kernel with a certain size, such as 3 × 3, over the image. This kernel traverses to the height and breadth of the input image and generates the convolved output [29]. The kernel elements can be understood as the weights which are learned in the neural network using backpropagation. Max pooling layer, subsamples the input and give only maximum value as an output. This way the spatial dimension shrinks (downsampling) if the neighborhood size is chosen as 2 × 2 then the output value will be only the max of all the four values, producing the output quarter of its input size. Upsampling can be considered as reverse of the maxpool function as after upsampling the spatial dimension of the output increases.
Activation functions used in the DCNNs introduces non-linearity in the network which helps to map complex functions. One example of activation function is ReLU (rectified linear unit) is defined as f (x) = max(0, x) this means ReLU just let pass the positive values as it is while converting all the negative values to zero [30].

Networks Used
Variants of auto-encoders, namely, SegNet-like [31] and UNet-like [32] architectures, were selected for segmenting the water areas from Sentinel-1 chips. These networks, as shown in Figure 2, were selected because they are simple compared with other existing networks for segmentation such as HRNet [33], DANet [34], etc., and they also have shown great performance when the dataset is limited in size [32,35]. Both networks can be divided into two parts, the contraction phase (encoder path), and the expansion phase (decoder path). Each block in the encoder path contains two convolution layers that have a kernel size of 3 × 3 and the 'same' padding along with the batch normalization [30] and rectilinear unit (relu) activation layer. This is followed by a max-pooling layer with a size of 2 × 2 and a stride of 2. In this way, the convolution layer increases the number of features in the channel space (depth) while the max-pooling layer contracts the dimensions of the spatial feature space. Between both networks, the number of blocks in the encoder and decoder path remained the same, but the method for increasing the spatial size (upsampling) in the decoder section was the main difference. Here, UNet uses up-convolution along with the skip connections to use the features from previous layers, while in SegNet, up-sampling in the decoder section uses pooling indices that are computed in the max-pooling step of the corresponding encoder blocks. Thus, in the case of the SegNet, only spatial information is transferred from the lower-level layers, while in the UNet, the low-level feature space is also transferred to the high-level feature space and concatenated with it at the corresponding levels. This passing of the low-level features to high-level becomes possible due to skip connection which can bypass the intermediate layers [30]. The networks remain fixed across the training cases and in various band-combination inputs. This means only the shape of the input layer is modified while all intermediate and output layers remain constant.

Hyperparameters
For the entire study, the mini-batch size was selected as 16 and iterated over the whole dataset 200 times (epoch). The loss function used here was a custom loss function that used both Dice loss and binary cross-entropy (BCE) [29,36] in a weighted manner. While the Dice score mainly looks for the similarity of segmentation blob, BCE calculates pixel-wise variance. This means the dice score do capture the spatial information better than the BCE, which is why Dice loss was given a higher weight of 0.85, and BCE received a lower weight of 0.15. The Adam optimizer [37] was used for training optimization, with an initial learning rate of 0.01. The learning rate is decayed for faster convergence and to avoid overfitting [29]. If there is no improvement (tolerance is set to 0.001) for continuously 10 epochs on a validation set, the learning rate reduced by the factor of 0.8. The minimum value for learning has been fixed to 0.0001. The training was performed on a single NVIDIA Titan-V GPU. The whole model development and training were performed using the Tensorflow platform along with the Keras library in Python.

Hyperparameters
For the entire study, the mini-batch size was selected as 16 and iterated over the whole dataset 200 times (epoch). The loss function used here was a custom loss function that used both Dice loss and binary cross-entropy (BCE) [29,36] in a weighted manner. While the Dice score mainly looks for the similarity of segmentation blob, BCE calculates pixel-wise variance. This means the dice score do capture the spatial information better than the BCE, which is why Dice loss was given a higher weight of 0.85, and BCE received a lower weight of 0.15. The Adam optimizer [37] was used for training optimization, with an initial learning rate of 0.01. The learning rate is decayed for faster convergence and to avoid over-fitting [29]. If there is no improvement (tolerance is set to 0.001) for continuously 10 epochs on a validation set, the learning rate reduced by the factor of 0.8. The minimum value for learning has been fixed to 0.0001. The training was performed on a single NVIDIA Titan-V GPU. The whole model development and training were performed using the Tensorflow platform along with the Keras library in Python.

Training Strategy
In total, three training cases were selected: (1) training using Sentinel-1 weak labels, (2) training using Sentinel-2 weak labels, and (3) training using more accurate hand labels. In each case, four SegNet-like and UNet-like networks were trained for the different band combinations: using both polarizations (VV, VH), using only cross-polarization (VH), using only co-polarization (VV), and using a ratio as the third band, making the input as VV, VH, and VH/VV. Here it should be noted that the VV and VH bands are already log scale, so the values of each pixel ranged between −50 and 1 dB. These inputs were normalized using min-max values so that the resultant values were between 0 and 1 before they were passed on for training. Additionally, for the calculation of the VH/VV ratio, we simply subtracted the log-scaled VH and VV bands due to the log properties: Later, transfer learning was also used to explore the option of making our model more adaptable and scalable. For this step, three cases were selected. In the first case, the whole network was retrained and pre-trained weights were used as starting weights rather than random weights, which are typically used during training from scratch. In the other two cases, we conducted training only during the contraction phase (encoders) while freezing the expansion phase (decoders) and vice versa. Transfer learning has various benefits, such as the ability to include more training data in the future to further tune the network, faster convergence due to pre-trained weights [38], and the possibility of

Training Strategy
In total, three training cases were selected: (1) training using Sentinel-1 weak labels, (2) training using Sentinel-2 weak labels, and (3) training using more accurate hand labels. In each case, four SegNet-like and UNet-like networks were trained for the different band combinations: using both polarizations (VV, VH), using only cross-polarization (VH), using only co-polarization (VV), and using a ratio as the third band, making the input as VV, VH, and VH/VV. Here it should be noted that the VV and VH bands are already log scale, so the values of each pixel ranged between −50 and 1 dB. These inputs were normalized using min-max values so that the resultant values were between 0 and 1 before they were passed on for training. Additionally, for the calculation of the VH/VV ratio, we simply subtracted the log-scaled VH and VV bands due to the log properties: Later, transfer learning was also used to explore the option of making our model more adaptable and scalable. For this step, three cases were selected. In the first case, the whole network was retrained and pre-trained weights were used as starting weights rather than random weights, which are typically used during training from scratch. In the other two cases, we conducted training only during the contraction phase (encoders) while freezing the expansion phase (decoders) and vice versa. Transfer learning has various benefits, such as the ability to include more training data in the future to further tune the network, faster convergence due to pre-trained weights [38], and the possibility of extending the trained model to be used with a new set of satellite images, such as from a different SAR satellite [39].

Testing on the Test Dataset
Three test cases were selected: all surface water detection, only permanent water detection (using corresponding JRC labels), and flooded water detection (difference between all water and permanent water). Because some of the image chips did not have any permanent water, they were removed from the test set of permanent water. In total, we had 90 test image chips for all water and flood water detection, and 54 chips for permanent water detection. All the trained networks, totaling 24 networks (SegNet and UNet), were tested over the given three test cases.

Testing as an off-the-Shelf Model on the Whole Image during the 2018 Kerala Floods
To verify the generalizability of the trained model for use as an off-the-shelf model during an emergency, a completely different flood event, the 2018 Kerala floods was selected. The first validation flood mask was prepared using the Sentinel-2 image. Although most of the Sentinel-2 image was covered by clouds, fortunately, the area most affected by the flooding had minimal cloud cover, so that area was selected, amounting to 794 km 2 ( Figure 3). After obtaining the desired area, the semi-automatic classification in QGIS was used over the Sentinel-2 bands B2, B3, B4, B8, B11, and B12 along with the MNDWI. In this step, certain regions of interests for water pixels were manually chosen across the selected area, after which classification was performed. However, this classification still had numerous errors and cloud obstructions. Thus, after the classification was complete, manual inspection was performed to further improve the classified results. In the end, these accurate classified results were exported as a binary flood mask, and this mask performed the role of ground truth in the validation phase.
tection (using corresponding JRC labels), and flooded water detection (difference between all water and permanent water). Because some of the image chips did not have any permanent water, they were removed from the test set of permanent water. In total, we had 90 test image chips for all water and flood water detection, and 54 chips for permanent water detection. All the trained networks, totaling 24 networks (SegNet and UNet), were tested over the given three test cases.

Testing as an off-the-Shelf Model on the Whole Image during the 2018 Kerala Floods
To verify the generalizability of the trained model for use as an off-the-shelf model during an emergency, a completely different flood event, the 2018 Kerala floods was selected. The first validation flood mask was prepared using the Sentinel-2 image. Although most of the Sentinel-2 image was covered by clouds, fortunately, the area most affected by the flooding had minimal cloud cover, so that area was selected, amounting to 794 km 2 ( Figure 3). After obtaining the desired area, the semi-automatic classification in QGIS was used over the Sentinel-2 bands B2, B3, B4, B8, B11, and B12 along with the MNDWI. In this step, certain regions of interests for water pixels were manually chosen across the selected area, after which classification was performed. However, this classification still had numerous errors and cloud obstructions. Thus, after the classification was complete, manual inspection was performed to further improve the classified results. In the end, these accurate classified results were exported as a binary flood mask, and this mask performed the role of ground truth in the validation phase. The Sentinel-1 images were first pre-processed using the European Space Agency's snappy package in Python to sequentially perform thermal noise removal, radiometric calibration, speckle filtering, and terrain correction. As the selected area lies where two satellites images from the same flight direction meet, so both images were merged and gap-filled using QGIS. The same method was applied to the two images from other flight direction as well. After this, two separate methods were employed for water area classification. The first is a thresholding method where a threshold was selected based on a combination of minimum distance and the Otsu method, which was implemented using scikit-image learning libraries. For another method, our best-performing trained model (after transfer learning), generated in the previous step, was used on the pre-processed image to obtain a flood map as binarized output. The whole image with a size of 13,797 × 7352 pixels was Remote Sens. 2021, 13, 2334 8 of 23 processed within 1 min and transformed into a binarized output. After this, the same area as that selected for the Sentinel-2 was clipped from the output for evaluation purposes.

Accuracy Evaluation
Four indicators were adopted to measure the performance of the proposed approach over the ground truth, as well as for the comparison with the thresholding method: Equation (1)-intersection over union (IoU), Equation (2)-F1 score, Equation (3)-omission error, and Equation (4)-commission error. These are defined as: where TP, FP, and FN denote the true positive, false positive, and false negative pixels, respectively.

Results on the Test Dataset
Figures 4 and 5 show the different models' mean IoU (mIoU) over the whole test set for SegNet and UNet, respectively. The x-axis shows the training cases, while the y-axis represents the mIoU. The detailed quantitative results from each network along with the respective errors are presented in Tables 2 and 3. Columns in the table represent the three detection test cases, namely, permanent water, flooded water, and all surface water. For each test case, the three evaluation criteria, mIoU, omission error (Om.), and commission error (Comm.), as used in [23], are presented. In the rows of the tables, the three training cases, namely, training using Sentinel-1 weakly labeled, Sentinel-2 weakly labeled, and hand-labeled data and their corresponding results are given. Each training case further has four variations consisting of different combinations of SAR bands. Along with our results, the baseline results from [23] are also shown for each training case, as well as Otsu thresholding results for better comparison.
In both types of networks, a common pattern can be seen. For the permanent water, the band combination of VV, VH and VH/VV ratio performed best in most of the cases, while in cases of flooded water and all surface water, the input with both polarizations VV and VH gave the best results. Note that when we used only the co-polarized band (VV), the network trained on weak labels performed worst, especially in the case of flooded water detection and all surface water detection, with a very high omission error. The cause can be understood by the property of SAR backscattering, which in the case of flooded vegetation or agriculture field may show very high backscattering in the co-polarized band due to double bounce (caused by the small wavelength of C-band SAR) [40]. A more detailed explanation is provided in Section 4.
In contrast with the results in Bonafilia et al. [23], where the best results came from Otsu thresholding for permanent water, our results clearly show that both SegNet and UNet convincingly surpass the benchmark data by Otsu thresholding, as well as the baseline results, in all training cases. However, other results for the flooded water and all-surface water are in sync with [23], as the best detection in the case of SegNet, as well as UNet, comes from the models trained using Sentinel-2 weak labeled dataset. Moreover, in the case of flooded water, our models show as much as 50% enhancement over the baseline and for all surface water also our model shows an improvement of around 40% with the UNet model trained with the hand labeling dataset.           Overall, the UNet-like networks outperformed the SegNet-like networks in detecting the flooded water and all surface water, which is the target in the study. One of the reasons may be the use of skip connections, which propagate the shallow layer features to the deeper layers, helping to create a better feature set for pixel-level classification. For this reason, subsequent processing was done using UNet only. This means that features from the encoder layers played a more important role in processing the SAR images, and this was further proved when transfer learning was used. In other words, encoder retraining gives better results than does decoder retraining.
Weak labelling technique has the advantage of creating a larger set of training samples in an automated way in a shorter time and less manpower than the hand labeling. A larger number of training samples helps in finding greater insight. However, hand-labeled data have consistency and include cases that could not be captured by weak labelling techniques. Therefore, transfer learning was employed to take advantage of both situations, namely, more samples for generalization and accurate labels for tuning. As our focus in the study is flood mapping, the model that was trained using Sentinel-2 weak labels with both polarization bands (VV and VH) was selected for transfer learning because it performed best among all other band combination, for "flooded water" and "all surface water" detection. Then, transfer learning was employed on it using hand-labeled data and retraining it for the three cases, namely, retraining the whole model, retraining only the expansion phase, and retraining only the contraction phase, with the pre-trained weights. The results are presented in Table 4. Overall, the model retrained on the encoder part showed the best result and that was used for real-time flood area detection at the chosen test site. To estimate the overall performance of the model for all test cases when models are trained using a hand-labeled dataset, a k-fold cross-validation procedure was carried out. The result of which is included in Appendix A.

Results on Test Site
The model resulting from the transfer learning performed notably well, and it was used on the test site in both ascending and descending flight directions. As shown in Table 5, it gave a better result than did the thresholding method. Moreover, the omission error was reduced significantly from around 16% to 6%, which is a very important criterion in emergency mapping, where omission error should be as low as possible. This means that false negatives should be fewer, even when some false positives may creep in. False negatives are a problem because leaving a flood-affected area off the map may lead to bad decision making-such as failing to evacuate or people travelling into the flooded regions.  Figure 6 shows the merged SAR images of the ascending flight direction and corresponding combined result of the surface water detection by Deep learning (our method) and Otsu thresholding. In the detection result, the white and black pixels are representing that both methods have classified the same either water or non-water, respectively. Contrarily to the red and cyan pixels which illustrate that both methods have classified differently. Cyan pixels imply that our method has classified the pixels as water whereas the thresholding method classified it as non-water and just the opposite in the case of Red pixels. In general, thresholding suffers from the noise in the output, as is visible in the combined results in terms of salt and pepper noise, as well as in the yellow and green insets. Owing to such kind of noise, a post-processing step, such as morphological erosion-dilation or minimal mapping unit application [16], is required after thresholding. The yellow rectangle displays a partially flooded agricultural area that was detected successfully by the deep learning model (in cyan color). In addition, the area shown by the green rectangle, which contains a few oxbow lakes on its far-right side, was successfully segmented by our model. In contrast, the blue rectangle shows the area around Kochi Port, which is one of the largest ports in India and docks multiple large vessels. This area produced some of the brightest pixels, and our method was not able to detect water in that area, while the thresholding method was able to achieve better results (red pixels). One of the reasons that the water was not detected by our method is that deep learning models learn the contextual information through spatial feature mapping, and it is a rare phenomenon to have water pixels covered by brighter pixels (in this case from ships). One way to detect such kind of rare events is by including few of the similar pattern in the training set or using some other ancillary data. White and black pixels in the resulting image represent pixels detected by both algorithms as either water or non-water, respectively. Whereas, red and cyan pixels show the difference in both the algorithms. Cyan pixels imply, that area was classified as water by DL but as non-water by OT and just the opposite in the case of red pixels.

Discussion
The results presented in Section 3 allow us to make the following observations: (1) When the labels are weak, models trained on the co-polarization VV band performed poorly in comparison to models trained on the cross-polarization VH band. One of the reasons can be the high sensitivity of co-polarization towards rough water surfaces, for example, due to wind, as described by Manjushree et al. [41] and Clement et al. [26]. However, for hand-labeled data, VV performs better than VH, especially for flooded areas. Figure 7 shows the results from the models trained on different band combinations. Because the training set here was hand-labeled, VV performed mostly better than VH bands except in the rows 6 and 7. One of the interesting outcomes was that the three bands combined (VV, VH, and their ratio) gave the best results, except for the first row in Figure 7. This combination provided very good improvement in some of the difficult test cases, as in rows 5-7. This was particularly interesting as no new information is provided in the third band, it is just the ratio of already present input bands. (2) Models trained on Sentinel-2 weakly labeled data gave better results in comparison to Sentinel-1 weakly labeled data, which is consistent with the results of Bonafilia et al. [23]. Moreover, the models trained on hand-labeled data approximately matches the accuracy of the models trained with Sentinel-2 data and sometimes even beat them despite limited samples, which goes against the results of Bonafilia et al. [23], who concluded that hand-labeled data are not necessary for training fully convolutional neural networks to detect flooding. We have demonstrated that models trained White and black pixels in the resulting image represent pixels detected by both algorithms as either water or non-water, respectively. Whereas, red and cyan pixels show the difference in both the algorithms. Cyan pixels imply, that area was classified as water by DL but as non-water by OT and just the opposite in the case of red pixels.

Discussion
The results presented in Section 3 allow us to make the following observations: (1) When the labels are weak, models trained on the co-polarization VV band performed poorly in comparison to models trained on the cross-polarization VH band. One of the reasons can be the high sensitivity of co-polarization towards rough water surfaces, for example, due to wind, as described by Manjushree et al. [41] and Clement et al. [26]. However, for hand-labeled data, VV performs better than VH, especially for flooded areas. Figure 7 shows the results from the models trained on different band combinations. Because the training set here was hand-labeled, VV performed mostly better than VH bands except in the rows 6 and 7. One of the interesting outcomes was that the three bands combined (VV, VH, and their ratio) gave the best results, except for the first row in Figure 7. This combination provided very good improvement in some of the difficult test cases, as in rows 5-7. This was particularly interesting as no new information is provided in the third band, it is just the ratio of already present input bands. (2) Models trained on Sentinel-2 weakly labeled data gave better results in comparison to Sentinel-1 weakly labeled data, which is consistent with the results of Bonafilia et al. [23]. Moreover, the models trained on hand-labeled data approximately matches the accuracy of the models trained with Sentinel-2 data and sometimes even beat them despite limited samples, which goes against the results of Bonafilia et al. [23], who concluded that hand-labeled data are not necessary for training fully convolutional neural networks to detect flooding. We have demonstrated that models trained with hand-labeled data perform better throughout, as shown in Tables 2 and 3. Figure 8 shows a few examples of the improvement achieved by hand-labeled data. However, sometimes models trained with hand-labeled data give over-detection, as can be seen in the red circled areas in the first and last rows of the figure.
(3) Successful implementation of transfer learning proves two things: first, there is no substitute for more accurate labels (hand-labeled data) as can be seen by the improved results. Second, that it is a good approach to generate many training samples automatically and a model trained on it gives better generalization. This is because more samples help in covering diverse cases and varieties of landcover. Further, we can use transfer learning to tune the model for our given test set. However, another interesting result is that, for finding surface water in SAR images, general features play a larger role than do specific features. As explained by Yosinski et al. [42], layers close to the input, encoder blocks in our case, are responsible for general feature extraction, and deep layers are responsible for obtaining specific features.
In our experiments, freezing the expansion phase and retraining the contraction phase gave the most favorable result. This can be further explored with different architectures; if the same behavior persists, then we may use many shallow layer networks, making an ensemble to detect water areas from SAR images without wasting too many resources. The enhancement in water area detection using transfer learning is presented in Figure 9. Some of the examples, such as rows 1, 2, and 5, show significant improvement. (4) If we look only at mIoU in the test dataset, then its value, which was less than 0.5, does not present a good picture of the surface water detection. However, if we see some examples of the test set true labels along with the detected mask, such as in Figure 8, where we can see that the detection is quite accurate, especially by the model trained on hand-labeled data. Similar accuracy is seen in Figure 9, which shows the results of transfer learning models. Some of the reasons for low mIoU can be understood in Figure 10. In rows 1 and 2 of Figure 10, where a very narrow stream has been labeled, this stream is either not visible in SAR image due to mountainous terrain (row 1) or trees growing along with it (row 2), and it becomes difficult to identify any significant water pixels in the SAR image. Here we also need to take care of the SAR imagery geometric effects due to the side looking imaging principle, this may miss the water bodies such as river or small lakes behind the shadow or under the layover effect in the mountainous region [43]. Another issue is having very small water bodies containing very few pixels scattered over the whole image (row 3). We need to consider that spatial resolution of Sentinel-1 IW-GRD images is 5 m × 22 m in range and azimuth respectively, so smaller water bodies cannot be detected [43]. In this case, even though very few numbers of pixels were miss detected but the IoU will be near zero, affecting the mIoU of the whole test dataset. A few incorrect labels are present in the test dataset. Some examples of this are shown in rows 5 and 6, where the red ellipses show the locations of incorrect labels. In these situations, even though our model is performing quite well, the IoU becomes very low or in some case goes to zero, such as in the last row. Whereas, according to the given label, there are no water bodies, so the intersection will be zero and the union will be the detected water body pixels, which will result in an IoU of zero. Moreover, there are also many possible scenarios where, due to the special properties of the SAR, the detection is not accurate, such as in the case of row 4 in Figure 10. This area was flooded in a field with sparse vegetation, as can be seen in the true-color image in the last row of Figure 11. This creates a double bounce from the specular surface of the water and vegetation in the co-polarized band (VV). This anomaly is the reason that the model is not able to identify it as a flooded field. A similar example is shown in the first row of Figure 11, where sand deposits in the river have high backscatter in the VV band. One possible reason for the high backscatter is the presence of moisture in the sand which increase the dielectric constant, so the reflectivity. In addition, the VV band is in general more susceptible to surface roughness, so higher reflectivity along with roughness may be the reason for high backscatter [40]. These special cases can be detected by the model if there are enough training samples that also have similar properties. detected by the model if there are enough training samples that also have similar properties.
Some recommendations for future flood mapping related datasets are: • Further classification of the flooded areas as the type of floods, such as open flood, flooded vegetation, and urban flood.

•
Ensuring that the test set is error-free and that enough samples are provided for a variety of flooded area types.

•
Removing the training sets that have less than a certain number of water pixels, as our main target is to learn to identify water pixels. In their absence, models do not learn anything significant, no matter how many samples are processed.

Conclusions
In this paper, we explored different SAR band combinations and their utility for surface water detection and flood mapping. We have found that using both polarizations is necessary for improved detection of flooded areas. Additionally, adding a third band as a ratio of two polarizations can add information in certain scenarios. We also proved that there is no way to avoid hand labelling completely, but it can be used in combination with weak labels for developing a more accurate model. This way we can take advantage of both situations: more samples from weak labelling for better generalization and accurate samples from hand labelling for fine-tuning during transfer learning. In addition, transfer learning showed that the same models can be enhanced with access to more training data in the future to further improve the same model. In this way, existing datasets can be used for NRT flood mapping. As this technique is using only a single image i.e., only during flooding image, it is much easier to implement a generalized model in any affected area without having the constraint of searching archived data and appropriate reference images. This way we have presented a whole pipeline to create the off-the-shelf model for NRT flood mapping using Sentinel-1 and demonstrated a notable improvement over thresholding techniques. We have shown that by this way we can process a whole satellite image in less than 1 min with a very low omission error. Thus, our models can be implemented as a prompt emergency response and information disburser for first responder organizations. This similar methodology can also be explored to utilize with other satellites in future.
Further improvements to the models can be made with access to better datasets in the future, such as more specific classes for floods (open floods, flooded vegetation, and urban floods) rather than only one general class. Moreover, some easily accessible ancillary data, such as height above the nearest drainage (HAND), can also be added for more refined detection. Ensuring that the test set is error-free and that enough samples are provided for a variety of flooded area types.

•
Removing the training sets that have less than a certain number of water pixels, as our main target is to learn to identify water pixels. In their absence, models do not learn anything significant, no matter how many samples are processed.

Conclusions
In this paper, we explored different SAR band combinations and their utility for surface water detection and flood mapping. We have found that using both polarizations is necessary for improved detection of flooded areas. Additionally, adding a third band as a ratio of two polarizations can add information in certain scenarios. We also proved that there is no way to avoid hand labelling completely, but it can be used in combination with weak labels for developing a more accurate model. This way we can take advantage of both situations: more samples from weak labelling for better generalization and accurate samples from hand labelling for fine-tuning during transfer learning. In addition, transfer learning showed that the same models can be enhanced with access to more training data in the future to further improve the same model. In this way, existing datasets can be used for NRT flood mapping. As this technique is using only a single image i.e., only during flooding image, it is much easier to implement a generalized model in any affected area without having the constraint of searching archived data and appropriate reference images. This way we have presented a whole pipeline to create the off-the-shelf model for NRT flood mapping using Sentinel-1 and demonstrated a notable improvement over thresholding techniques. We have shown that by this way we can process a whole satellite image in less than 1 min with a very low omission error. Thus, our models can be implemented as a prompt emergency response and information disburser for first responder organizations. This similar methodology can also be explored to utilize with other satellites in future.
Further improvements to the models can be made with access to better datasets in the future, such as more specific classes for floods (open floods, flooded vegetation, and urban floods) rather than only one general class. Moreover, some easily accessible ancillary data, such as height above the nearest drainage (HAND), can also be added for more refined detection. To investigate the generalization capability of the model as well as to ensure that model is not overfitting on the given data, K-fold cross-validation was used. For the implementation we have used hand-labeled data. First dataset has been divided into five equal parts or folds, and then four model for each band combination has been trained by leaving one part for testing and using four parts for training. This way we have trained five sets of models by leaving different part as a test set in every time, to cover the whole dataset. The result of the models against the permanent water, flooded water and all-surface water is shown in Figure A1-A3. The average of all five models with each band combination is mentioned in Figure A1 along with the standard deviation. Results suggest that our models are consistent throughout different folds with standard deviation ranging between 2-4%.