Enhancement of Detecting Permanent Water and Temporary Water in Flood Disasters by Fusing Sentinel-1 and Sentinel-2 Imagery Using Deep Learning Algorithms: Demonstration of Sen1Floods11 Benchmark Datasets

: Identifying permanent water and temporary water in ﬂood disasters efﬁciently has mainly relied on change detection method from multi-temporal remote sensing imageries, but estimating the water type in ﬂood disaster events from only post-ﬂood remote sensing imageries still remains challenging. Research progress in recent years has demonstrated the excellent potential of multi-source data fusion and deep learning algorithms in improving ﬂood detection, while this ﬁeld has only been studied initially due to the lack of large-scale labelled remote sensing images of ﬂood events. Here, we present new deep learning algorithms and a multi-source data fusion driven ﬂood inundation mapping approach by leveraging a large-scale publicly available Sen1Flood11 dataset consisting of roughly 4831 labelled Sentinel-1 SAR and Sentinel-2 optical imagery gathered from ﬂood events worldwide in recent years. Speciﬁcally, we proposed an automatic segmentation method for surface water, permanent water, and temporary water identiﬁcation, and all tasks share the same convolutional neural network architecture. We utilize focal loss to deal with the class (water/non-water) imbalance problem. Thorough ablation experiments and analysis conﬁrmed the effectiveness of various proposed designs. In comparison experiments, the method proposed in this paper is superior to other classical models. Our model achieves a mean Intersection over Union (mIoU) of 52.99%, Intersection over Union (IoU) of 52.30%, and Overall Accuracy (OA) of 92.81% on the Sen1Flood11 test set. On the Sen1Flood11 Bolivia test set, our model also achieves very high mIoU (47.88%), IoU (76.74%), and OA (95.59%) and shows good generalization ability.


Introduction
Natural hazards such as floods, landslides, and typhoons pose severe threats to people's lives and property. Among them, floods are the most frequent, widespread, and deadly natural disasters. They affect more people globally each year than any other disaster [1][2][3]. Floods not only cause injuries, deaths, losses of livelihoods, infrastructure damage, and asset losses, but they can also have direct or indirect effects on health [4]. In the past decade, there have been 2850 disasters that triggered natural hazards globally, of which 1298 were floods, accounting for approximately 46% [1]. In 2019, there were 127 floods worldwide, affecting 69 countries, causing 1586 deaths and more than 10 million displacements [2]. In the future, floods may become a frequent disaster that poses a huge threat to human society due to sea-level rise, climate change, and urbanization [5,6].
The effective way to reduce flood losses is to enhance our ability of flood risk mitigation and response. In recent years, timely and accurate flood detection products derived from satellite remote sensing imagery are becoming effective methods of responding flood disaster, and city and infrastructure planners, risk managers, disaster emergency response agency, and property insurance company are benefiting from it in worldwide [7][8][9][10][11], while identifying permanent water and temporary water in flood disasters efficiently is a remaining challenge.
In recent years, researchers have done a lot of work in flood detection based on satellite remote sensing images. The most commonly used method to distinguish between water and non-water areas is a threshold split-based method [12][13][14][15]. However, the optimal threshold is affected by the geographical area, time, and atmospheric conditions of image collection. Therefore, the generalization ability of the above methods is greatly limited. In recent years, the European Space Agency (ESA) developed a series of Sentinel missions to provide free available datasets including Synthetic Aperture Radar (SAR) data from Sentinel-1 sensor and optical data from Sentinel-2 sensor, taking into account the respective advantages of optical imagery and SAR imagery in flood information extraction [5,14,[16][17][18], and the investigation of combining SAR images with optical images for more accurate flood mapping is of great interest for researchers [16,[19][20][21][22]. Although the data fusion method improves the accuracy of flood extraction, it is still very challenging to distinguish permanent water from temporary water problems in flood disasters. The identification of temporary waters in flood disasters and permanent waters mainly rely on multi-temporal change detection methods [23][24][25][26][27][28]. This type of approach requires at least one pair of multi-temporal remote sensing scenes which were acquired before and after a flood event. Although the method based on multi-temporal change detection can better detect the temporary water in flood disaster events, the multi-temporal method was greatly limited due to the mandatory demand for satellite imagery before disasters.
Deep learning methods represented by convolutional neural networks have been proven to be effective in the field of flood damage assessment, and related research has grown rapidly since 2017 [29]. However, most algorithms are focusing on affected buildings in flood events [22,30], with very few examples of flood water detection. The latest research focuses on the application of deep learning algorithms for enhancing flood water detection [31,32]. Early research focused on the extraction of surface water [33][34][35]. Furkan et al. proposed a deep-learning-based approach for surface water mapping from Landsat imagery. The results demonstrated that the deep learning method outperform the traditional threshold and Multi-Layer Perceptron model. Considering the difference in the characteristics of surface water and floods in satellite imagery, which will increase the difficulty of flood extraction. Maryam et al. developed a semantic segmentation method for extracting the flood boundary from UAV imagery. The semantic segmentation-based flood extraction method was further applied to identify the flood inundation caused by mounting destruction [36]. Experimental results validate the efficiency and effectiveness of the proposed method. Muñoz et al. [37] combined the multispectral Landsat imagery and dual-polarized synthetic aperture radar imagery to evaluate the performance of integrating convolutional neural network and data fusion framework for generating compound flood mapping. The usefulness of this method was verified by comparing with other methods. These studies show that deep learning algorithms play an important role in enhancing flood classification. However, research in this field is still in its infancy, due to the lack of high-quality large-scale flood annotation satellite datasets.
Recent development in earth observation has contributed a series of open-sourced large scale disaster related satellite imagery datasets, which has greatly spurred the advance of leveraging deep learning algorithm for disaster mapping from satellite imagery. For building damage classification, the xBD dataset has provided large scale satellite imagery data that collected from the multi-type disasters with four category damage level labels to worldwide researchers, and the research spawned by this public data has also verified the great potential of deep learning in building damage recognition [38,39]. For flooded building damage assessment in Hurricane disaster events, FloodNet provides a highresolution UAV image dataset and has done the same task [40]. The recent release of the large-scale open-source Sen1Floods11 dataset [5] is boosting the research of utilizing deep learning algorithms for water type detection in flood disasters [41]. For water type detection in flood disaster events, Sen1Flood should take on a similar role. Unfortunately, so far, only one preliminary work has been conducted.
With the purpose of developing an efficient benchmark algorithm for distinguishing between permanent water and temporary water in flood disasters based on the Sen1Flood dataset to boost the research in this area, the contributions and originality of this research are as follows.

•
Effectiveness: To the best of our knowledge, in terms of the sen1flood11 dataset, the accuracy of our proposed algorithm is the highest so far.
• Convenience: All of the sentinel-1 and sentinel-2 imagery utilized in the model come from post-flood imagery, and this greatly reduces the reliance on satellite imagery data before the flood.
• Refinement: We introduced salient object detection algorithm to modify the convolutional neural network classifier, in addition, the multi-scale loss algorithm and data augmentation were adopted to improve the accuracy of the model.

•
Robustness: the robustness of our proposed algorithm was verified in a new Bolivia flood dataset.

Sen1Floods11 Dataset
We utilize the Flood Event Data in the Sen1Floods11 dataset [5] to train, validate, and test deep learning flood algorithms. This dataset provides raw Sentinel-1 SAR images (IW mode, GRD product), Sentinel-2 MSI Level-1C images, classified permanent water, and flood water. There are 4831 non-overlapping 512 × 512 tiles from 11 flood events. This dataset helps map flood at the global scale, covering 120,406 square kilometers, spanning 14 biomes, 357 ecological regions, and six continents. Locations for flood events are shown in Figure 1.
For each selected flood event, the time interval between the acquisition of Sentinel-1 imagery and Sentinel-2 imagery shall not exceed two days. The Sentinel-1 imagery contains two bands, VV and VH, which are backscatter values; the Sentinel-2 imagery includes 13 bands, and all bands are TOA reflectance values. The imagery is projected to the WGS-84 coordinate system.The ground resolution of the imagery is different on different bands. In order to fuse images, the ground resolution is sampled to 10 m on all bands. Each band is visualized in Figure 2.
Due to the high cost of hand labels, 4370 tiles are not hand-labeled and exported with annotations automatically generated by the Sentinel-1 and Sentinel-2 flood classification algorithms, which can serve as weakly supervised training data. The remaining 446 tiles are manually annotated by trained remote sensing analysts for high-quality model training, validation and testing. The weakly supervised data contain two types of surface water labels. One is produced by the histogram thresholding method based on the Sentinel-1 image; the other is generated by the Normalized Difference Vegetation Index (NDVI), MNDWI and thresholding method based on the Sentinel-2 image. All cloud and cloud shadow pixels were masked and excluded from training and accuracy assessments. Hand labels include all water labels and permanent water labels. For all water labels, analysts exploited Google Earth Engine to correct the automated labels using Sentinel-1 VH band, two false color composites from Sentinel-2 and the reference water classification from Sentinel-2 by removing uncertain areas and adding to the water classification. For the permanent water label, with the help of the JRC (European Commission Joint Research Center) surface water data set, Bonafilia et al. [5] labeled the pixels that were detected as water at both the beginning (1984) and end (2018) of the dataset as permanent water pixels. The pixels never observed as water during this period are treated as non-water pixels. The remaining pixels are masked. Examples of water label are visualized in Figure 3.
Remote Sens. 2021, 1, 0 4 of 20 the permanent water label, with the help of the JRC (European Commission Joint Research Center) surface water data set, Bonafilia et al. [5] labeled the pixels that were detected as water at both the beginning (1984) and end (2018) of the dataset as permanent water pixels. The pixels never observed as water during this period are treated as non-water pixels. The remaining pixels are masked. Examples of water label are visualized in Figure 3.  Due to the high cost of hand label, 4,370 tiles are not hand-labelled and exported with annotations automatically generated by Sentinel-1 and Sentinel-2 flood classification algorithm, which can serve as weakly supervised training data. The remaining 446 tiles are manually annotated by trained remote sensing analysts for high-quality model training, validation and testing. The weakly supervised data contains two types of surface water labels. One is produced by the histogram thresholding method based on the Sentinel-1 image; the other is generated by the Normalized Difference Vegetation Index (NDVI), MNDWI and thresholding method based on the Sentinel-2 image. All cloud and cloud shadow pixels were masked and excluded from training and accuracy assessments. Hand labels include flood water labels and permanent water labels. For flood water labels, analysts exploited Google Earth Engine to correct the automated labels using Sentinel-1 VH band, two false color composites from Sentinel-2 and the reference water classification from Sentinel-2 by removing uncertain areas and adding to the water classification. For the permanent water label, with the help of JRC (European Commission Joint Research Centre) surface water data set, Bonafilia et al. [5] labeled the pixels   Due to the high cost of hand label, 4,370 tiles are not hand-labelled and exported with annotations automatically generated by Sentinel-1 and Sentinel-2 flood classification algorithm, which can serve as weakly supervised training data. The remaining 446 tiles are manually annotated by trained remote sensing analysts for high-quality model training, validation and testing. The weakly supervised data contains two types of surface water labels. One is produced by the histogram thresholding method based on the Sentinel-1 image; the other is generated by the Normalized Difference Vegetation Index (NDVI), MNDWI and thresholding method based on the Sentinel-2 image. All cloud and cloud shadow pixels were masked and excluded from training and accuracy assessments. Hand labels include flood water labels and permanent water labels. For flood water labels, analysts exploited Google Earth Engine to correct the automated labels using Sentinel-1 VH band, two false color composites from Sentinel-2 and the reference water classification from Sentinel-2 by removing uncertain areas and adding to the water classification. For the permanent water label, with the help of JRC (European Commission Joint Research Centre) surface water data set, Bonafilia et al. [5] labeled the pixels  Like most existing studies [42], the Sen1Floods11 dataset shows the highly imbalanced distribution between flooded and unflooded area. As shown in Table 1, for all water, water pixels account for only 9.16%, and non-water pixels account for 77.22%, which is about eight times the number of surface water pixels. The percentages of water pixels and non-water pixels in permanent waters are 3.06% and 96.94%, respectively, and the number of non-water pixels is about 32 times that of non-water pixels. The dataset is split into three parts: training set, validation set, and test set. All 4370 images automatically labeled are used as the weakly supervised training set. The hand-labeled data are first randomly split into training, validation, and testing data in the proportion 6:2:2. In order to test the model's ability to predict unknown flood events, all hand-labeled data related to the Bolivia flood event is held out for a distinct test set. Rest hand-labeled data in training, validation, and testing sets are composed of final training, validation, and testing set, respectively. Correspondingly, all data from Bolivia in the weakly-supervised training set are also excluded and do not participate in model training. The overall composition of the dataset is shown in Table 2.  Figure 4 depicts a flowchart of our work. In this work, the benchmark Sen1Floods11 dataset that contains 4831 samples with 512 × 512-pixel size from both Sentinel-1 and Sentinel-2 imagery were utilized to develop the algorithm. The spatial resolution resampling and pixel value normalization were adopted to fuse sentinel-1 and sentinel-2 imagery. The model input is a stack of fused image bands with permanent water and temporary water annotations. The network used is BASNet proposed by Qin et al. [43] is used. BASNet architecture is shown in Figure 5. The network combines a densely supervised encoder-decoder network similar to U-Net and a new residual refinement module. The encoder-decoder produces a coarse probability prediction map from the image input, and the Residual Refinement Module is responsible for learning the residuals between the coarse probability prediction map and the ground truth. We apply the network to remote sensing data sets, adapt, train, and optimize it to better predict flood areas.

Encoder-Decoder Network
The encoder-decoder network can fuse abstract high-level information and detailed low-level information and is mainly responsible for water body segmentation. The encoder part contains an input convolution block and six convolution stages consisting of basic res-blocks. The input convolution block comprises a convolution layer with batch normalization [44] and Rectified Linear Unit (ReLU) activation function [45]. The size of the convolution kernel is 3 × 3, and the stride is 1. This convolution block can convert an input image of any number of channels to a feature map of 64 channels. The first four convolution stages directly use the four stages of ResNet34 [46]. Except for the first residual block, each residual block will double the feature map's channels. The last two convolution stages have the same structure, and both consist of three basic res-blocks with 512 filters and a 2 × 2 max pooling operation with stride 2 for downsampling.
Compared with traditional convolution, atrous convolution can obtain a larger receptive field without increasing parameter amount and capture long-range information. In addition, atrous convolution can avoid the reduction of feature map resolution caused by repeated downsampling and allow a deeper model [47,48]. To capture global information, Qin et al. [43] designed a bridge stage to connect the encoder and the decoder. This stage is comprised of three atrous convolution blocks. Each atrous convolution block consists of a convolution layer with 512 atrous 3 × 3 filters with dilation 2, a batch normalization, and a ReLU activation function.
In the decoder part, each decoder stage corresponds to an encoder stage. As shown in the Formulas (1) and (2), g i is the merge base, f i is the feature map to be fused, h i is the merged feature map, and conv 3×3 is a convolution layer followed by batch normalization and ReLU activation, RRM represents the Residual Refinement Module and the operator [·; ·] represents concatenation along the channel axis. There are three convolution layers with batch normalization and a ReLU activation function in each decoder stage. The feature map from the last stage is first fed to an up-sampling layer to get g i+1 , and then concatenated with the current feature map f i . In order to alleviate overfitting, the last layer of the bridge stage and each stage of the encoder is fed to a 3 × 3 convolution layer followed by a bilinear up-sampling layer and sigmoid activation function to generate a prediction map and then supervised by the ground truth:

Residual Refinement Module
The residual refinement module learns the residuals between the coarse maps and the ground truth and then adds them to the coarse maps to produce the final results. By fine-tuning the prediction results, the fuzzy and noisy boundaries can be made sharper. The probability gap between water and non-water pixels can be increased. Compared to the encoder-decoder network, the residual refinement module has a simpler architecture containing an input layer, a four-stage encoder-decoder with a bridge, and an output layer. Each stage has only one convolution layer followed by a batch normalization and a ReLU activation function. The convolution layer has 64 3 × 3 filters with stride 1. In addition, down-sampling and up-sampling are performed through non-overlapping 2 × 2 max pooling and bilinear interpolation in the encoder-decoder network.

Hybrid Loss
Training loss is defined as the summation of all outputs' losses: where l is the loss of the k-th output. Here, K = 8, including seven side outputs from the bridge stage and decoder and one final output from the refinement module. Each loss is comprised of three parts: focal loss [49], Structural SIMilarity (SSIM) loss [50], and Intersection over Union (IoU) loss [51]: We replace Binary Cross Entropy (BCE) loss in Qin et al. (2019) [43] with focal loss. It is defined as: where y specifies the ground-truth class and p is the model's estimated probability for the water class. The focal loss is designed to address the extreme imbalance between water and non-water classes during training. On the one hand, using weighting factor α ∈ [0, 1] for water and 1 − α for non-water to balance the importance of water/non-water pixels, focal loss can avoid non-water pixels dominating the gradient. Larger α puts more weight on water pixels. On the other hand, with the modulating factor γ, focal loss can reduce the loss contribution from easy examples and thus focus training on hard non-waters (e.g., boundary pixels). Focal loss is a pixel-level measure and can be utilized to maintain a smooth gradient for all pixels.
Taking each pixel's local neighborhood into account, SSIM loss is a patch-level measure and is developed to capture the structural information in an image. It gives a higher loss around the boundary when the predicted probabilities on the boundary and the inner pixels are the same. Thus, it can drive the model to focus training on the boundary pixels, which are usually harder to classify. It is defined as: where µ x , µ y , and σ x , σ y are the mean and standard deviations of patch x and y, respectively, σ xy is their covariance. C 1 and C 2 are constants which are used to avoid zero denominators. IoU loss is a map-level measure. It gives more focus on the water body, whose error rate is usually higher than non-water area. For water pixels, a larger p r,c stands for higher confidence of the network prediction. The higher the model prediction's confidence of the water is, the lower the loss is. It is defined as: where H, W are the height and width of the image, respectively.

Experimental Analysis
As introduced in the data section, we first use the 4160 automatically annotated images to pre-train our network and then use 252 manually annotated images to fine-tune it. We monitor convergence and overfitting during training on the validation set while evaluating the model performance on the test set and Bolivia test set.

Implementation Detail and Experimental Setup
The backbone used in BASNet is ResNet-34 [46] for all the experiments, which was pre-trained on ImageNet [52]. Other convolution layers are initialized by Xavier [53]. For hyperparameters introduced by focal loss, γ is set to 2.5, and α is set to 0.25. We utilize the Adam optimizer [54] to train our network and all hyperparameters are set by default, namely lr = 0.001, betas = (0.9, 0.999), eps = 1 × 10 −8 , weight dacay = 0. A "poly" learning rate policy is used to adjust the learning rate; that is, the learning rate is multiplied by (1 − step maxstep ) power with power = 0.9. The image batch size during training is 8. We apply some data augmentation methods to enhance the generalization ability of our model. Specifically, random horizontal and vertical flip and random rotation of 45 × k, (k = 1, 2, 3, 4) degrees with a probability of 0.5 are performed on each image. For extra data preprocessing, we randomly crop the image into a fixed size of 256 × 256, and the pixel value is normalized to a value in the range of [0, 1].
We implement our network based on the Pytorch deep learning framework. Both training and testing are conducted on Public Computing Cloud, Renmin University of China using a single NVIDIA Titan RTX GPU and an Inter Xeon Silver 4114 @2.20GHz CPU.

Evaluation Metrics
We use five measures to evaluate our method: Intersection over Union (IoU), mean Intersection over Union (mIoU, equal weighting of all tiles), Overall Accuracy (OA), omission error rates (OMISSION), and commission error rates (COMMISSION) [55]. IoU, mIoU, and OA are standard evaluation metrics for image segmentation tasks. Omission and commission error rates are reported for comparison to remote sensing literature [5,56]. Omission rates are false negative water detection rates, and commission rates are false positive water detection rates. When the number is larger, the model's performance is worse. Following the common practice, we used the mIoU as the primary metric to evaluate methods. We calculate all the above metrics for water body segmentation, permanent water body segmentation, and temporary water body segmentation. The definitions are as follows: where i is the index of image. For the i-th sample, TP i , TN i , FP i , FN i is the number of accurately categorised water pixels, accurately categorized non-water pixels, misclassified non-water pixels, and misclassified water pixels, respectively.

Ablation Study
In this section, we conduct experiments to verify each improvement measures' effectiveness in our model. The ablation study contains three parts: data augmentation ablation, loss ablation, and image fusion ablation. The baseline is the original BASNet network with only the Sentinel-1 SAR image as input. Tables 3 and 4 show summaries of the test set and Bolivia test set results, respectively.

Data Augmentation
As can be seen from Table 3, after applying data augmentation, on the test set, the mIoU, IoU, and OA of the water body segmentation task are increased by 0.97%, 7.19%, and 4.55%, respectively. Those of the temporary water body segmentation task increase by 2.12%, 1.76%, and 2.08%, respectively. There is a decrease on the permanent water body detection task. Data augmentation improves mIoU, IoU, and OA on all three tasks on the Bolivia test set (Table 4).

Loss Function
As shown in Table 5, the number of water pixels is far more than that of non-water pixels on all tasks. Specifically, the number of non-water pixels is 8. 43, 17.49, and 11.81 times that of surface, permanent and temporary water pixels, respectively. We can see from Tables 3 and 4, on the test set, that focal loss outperforms cross-entropy loss with respect to mIoU, IoU, and OA on all tasks, especially the permanent water mapping task. For permanent water body extraction, focal loss brings improvements of mIoU by 25.64%, IoU by 12.74%, and OA by 5.32% (Table 3). On the Bolivia test set, focal loss significantly improved the permanent water detection, showing an improvement for mIoU of 26.87%, IoU of 36.02%, and OA of 4.90% (Table 4). Meanwhile, it achieves the best performance on the test set's surface and temporary water detection tasks and the Bolivia test set's surface water detection task. In addition, it achieves competitive performance in temporary water detection on the Bolivia test set. Table 5. Water, Non-Water, and Ignored area proportion at the pixel level of all water (AW), permanent water (PW), and temporary water (TW).

Label
Water Non-Water Ignored Comparison with other loss functions: Besides focal loss, distributional ranking (DR) loss [57] and normalized focal loss [58] are also proposed to address the class imbalance problem.
DR loss [57] treats the classification problem as a ranking problem and improves object detection by distributional ranking supplementary. The distributional ranking model ranks the distributions between positive and negative examples in the worse-case scenario. As a result, this loss can handle a class imbalance problem and the problem of imbalanced hardness of negative examples as well as maintaining efficiency. In addition, it can separate the foreground (water) and background (non-water) with a large margin. We have the DR loss as: where j + and j − denote the water and non-water pixels, respectively. q + ∈ ∆ and q − ∈ ∆ denote the distributions over water and non-water pixels, respectively. P + and P − represent the expected scores under the corresponding distribution.∆ = {q : ∑ j q j = 1, ∀j, q j ≥ 0}. Zheng et al. [58] modified focal loss for balanced optimization. They adjust the loss distribution without a change of sum to avoid gradient vanishing. The paper introduced a normalization constant Z that guarantees where l(p j , y j ) denotes the j-th pixel's cross entropy loss. p j represents the j-th pixel's predicted probability. In addition, y j is its ground truth. Hence, for the loss of each pixel, they produce a new weight 1 Z (1 − p j ) γ . We carry out experiments to compare these loss functions with focal loss. We use DR loss with λ + = 1, λ − = 1 log(3.5) , L = 6, τ = 4 and set γ = 2.5, α = 0.25 in focal loss and normalized focal loss. Table 5 shows that the permanent water detection task suffers the most from the class imbalance problem. Tables 6 and 7 compare the mIoU of the three loss functions on the three tasks. Focal loss obtains slightly inferior results in surface and temporary water detection on the test set. Still, both on the test and Bolivia test set, focal loss produces the highest mIoU in permanent water detection. With respect to normalized focal loss, focal loss (on the test set and the Bolivia test set) brings 2.74% and 3.16% performance gains in mIoU, respectively. For DR loss, focal loss gains 1.69% and 10.83% in mIoU, respectively. In addition, focal loss outperforms normalized focal loss and DR loss on the Bolivia test set across all other tasks.

Image Fusion
After fusing Sentinel-2 optical imagery and Sentinel-1 SAR imagery, results improve significantly on all tasks, which demonstrates that optical imagery can provide useful supplementary information on water segmentation.

Evaluation on the Sen1Floods11 Test Set
To evaluate our method, we conduct comprehensive experiments on the Sen1Floods11 dataset. On the one hand, since Sen1Floods11 is a newly released data set, there are few existing methods to conduct experiments on this data set; on the other hand, existing methods use different evaluation metrics, and the experimental results of different methods cannot be directly compared. Therefore, we reproduce the classical methods in remote sensing and some CNN (convolutional neural network)-based methods from classical to state-of-the-art to compare with our model under uniform experimental conditions and with the same evaluation codes. These methods include the Otsu thresholding method based on the VH band [5], FCN-ResNet50 [5], Deeplab v3+ [59], and U 2 -Net [60].
The Otsu thresholding method [5,15] is a widely used method in water body extraction. It can degenerate a grayscale image into a binary image with the best threshold to distinguish the two different types of pixels. The between-class variance is calculated according to the specific algorithm corresponding to Otsu [15]. Then, find the threshold corresponding to the largest between-class variance as the best threshold. This method is unsupervised, simple, and fast. ResNet [46] is utilized as a standard backbone in most networks. Bonafilia et al. [5] use the fully convolutional neural networks (FCNN) model with a ResNet50 backbone to map floods. Here, we compare our method with it. Our tasks can also be regarded as segmentation tasks. Chen et al. [48] proposed Deeplab for semantic segmentation. Here, we try to use it to map floods and compare it with our model. We use the latest version of Deeplab(Deeplabv3+ [59]) as our comparison. We replace its Aligned Xception with Resnet-50 [46] to decrease the parameters amount and computational complexity. Moreover, to account for the relatively small batch size, we convert all of the batch normalization layers to group normalization layers. Considering water bodies as the salient object, we can solve our problems by SOD models. U 2 -Net [60] is a SOD network and has mainly two advantages compared with previous architectures. First, it allows training from scratch rather than from existing pre-trained backbones, which avoids the problem of distributional differences between RGB images and satellite imagery; second, it can achieve deeper architecture while maintaining high-resolution feature maps at a low memory and computation cost.
Quantitative comparison: We train and test all the models on the same data set. In addition, we use the same evaluation code to evaluate all the predicted maps for a fair comparison. Table 8 summarizes the mIoU, IoU, OMISSION, COMMISSION, and OA of all the methods on the Sen1Floods11 test set. We underline the best results under each metric. As can be seen, on the test set, the method proposed in this paper outperforms other methods by a large margin (over 18%) in surface water, permanent water as well as temporary water segmentation in mIoU. In terms of COMMISSION and OA, our model achieves the best result on all tasks. For IoU, the proposed method largely improves most tasks (over 8%) except that Otsu and FCN-ResNet50 are superior in permanent water segmentation on the test set. There is still room for improvements on the test set. Qualitative comparison: Figure 6 depicts the flooding maps of each model (binary maps for the Otsu method, probability maps for FCN-ResNet50, Deeplab, U 2 Net, and our model) on some samples from the Sen1Floods11 test set. For the Otsu method, lots of details such as small river tributaries and fragmented land in the middle of rivers are missed. For FCN-ResNet50, Deeplab, and U 2 Net, besides missing details, we can observe the large gray area in predicted probability maps, which demonstrates that these CNN-based models can only produce low confidence predictions and blurred boundary. However, our method produces both clear boundaries and sharp-contrast maps. Even in urban area, our method produce an accurate map. Compared with other models, the proposed method produces clearer and more accurate prediction maps.

Evaluation of the New Scenario: Bolivia Flood Datasets
Quantitative comparison: Table 9 summarizes the mIoU, IoU, OMISSION, COMMIS-SION, and OA of all the methods. We underline the best results under each metric. As can be seen, on the Bolivia test set, the method proposed in this paper increases mIoU by over 5% in surface water, permanent water, as well as temporary water segmentation. Our model achieves the best result on all tasks in COMMISSION and OA. For IoU, the proposed method improves over 7% on all tasks. Our method performs well on the Bolivia test set in terms of OMISSION. Qualitative comparison: Figure 7 depicts the flooding maps of each model (binary maps for the Otsu method, probability maps for FCN-ResNet50, Deeplab, U 2 Net, and our model) on some samples from Sen1Floods11 Bolivia test set. In some challenging scenes, such as low-contrast foreground and cloud occlusion areas, our method still obtains robust results.

Discussion
Data augmentation can improve the generalization ability of the model, especially when there is only a little training data. Since the Bolivia test set consists of completely unknown flood event images, the gains indicate that the model's generalization ability has been improved. Focal loss [49] is designed to deal with the extreme imbalance between water/non-water, difficult/easy pixels during training. The ablation experimental results show the effectiveness of focal loss in dealing with a sample imbalance problem. Optical imagery contains information on the ground surface's multispectral reflectivity, which is widely used in water indices and thresholding methods. Image fusion aims to use optical image data to assist SAR image prediction. Our experimental results demonstrate that optical imagery can provide useful supplementary information on water segmentation.
However, all water and temporary water have poor mIoU scores. There are two reasons to explain this phenomenon. One reason is the training process, as the image of all water and temporary water data have more water pixels and fewer non-water pixels. The difference in the sample size leads to differences in the learning effect. From OMISSION and COMMISSION, we can see that the all water and temporary water tasks perform better than permanent water on water pixels, and perform worse on non-water pixels. On the whole, non-water pixels dominate our data. The poorer prediction of non-water pixels leads to worse overall results. The other reason is the difference in image characteristics, and the images in all water and temporary water data contain more small tributaries and scattered areas from newly flooded areas. These areas are usually more challenging to identify.
With the help of hybrid loss, our model pays more attention to boundary pixels and increasing the confidence of the prediction. As a result, our method can not only produce richer details and sharper boundaries but also distinguish water and non-water pixels with a larger probability gap. The excellent feature extraction ability of deep learning model enables our model to deal with some challenging scenes.

Conclusions
In this paper, we developed an efficient model for detecting permanent water and temporary water in flood disasters by fusing Sentinel-1 and Sentinel-2 Imagery using a deep learning algorithm with the help of benchmark Sen1Floods11 datasets. The BASNet network adopted in this can capture both large-scale and detailed structural features. By combining with focal loss, our model achieved state-of-the-art accuracy for hard boundary pixels' identification. The model's performance was further improved by fusing the multisource information, and the ablation study verified the effectiveness of each improvement measures. The comparison experiment results demonstrated that the implemented method could detect permanent water and temporary water flood more accurately than other methods. The proposed model performed well on the unknown Bolivia test set, which verifies its robustness. Due to the network architecture's modularity, it can be easily adapted to data from other sensors. Finally, the method does not require prior knowledge, additional data pre-processing, and multi-temporal data, which significantly reduces the method's complexity and increases the degree of automation.
Ongoing and future works focus on training water segmentation models on high spatial resolution remote sensing imagery. High spatial resolution remote sensing imagery has more complex background information, objects with larger-scale variation, and more unbalanced pixel classes [58]. More sophisticated modules are required to extract and fuse richer image information. In addition, the existing pre-trained neural networks are all based on RGB images, and, directly applied to remote sensing images, may reduce the efficiency of transfer learning due to differences in the data distribution. McKay et al. [61] dealt with this problem by discarding deep feature layers. Qin et al. [60] designed a network that allows for training from scratch, but this lighter network may degrade the performance. Although we dramatically improved the results of flood mapping, there is still much work to do. Acknowledgments: This work was supported by the Public Computing Cloud, Renmin University of China. We also thank the Core Research Cluster of Disaster Science at Tohoku University (a Designated National University) for their support. We thank the reviewers for their helpful and constructive comments on our work. The author gratefully acknowledges the support of K.C. Wong Education Foundation, Hong Kong.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: