DUPnet: Water Body Segmentation with Dense Block and Multi-Scale Spatial Pyramid Pooling for Remote Sensing Images

: Water body segmentation is an important tool for the hydrological monitoring of the Earth. With the rapid development of convolutional neural networks, semantic segmentation techniques have been used on remote sensing images to extract water bodies. However, some difﬁculties need to be overcome to achieve good results in water body segmentation, such as complex background, huge scale, water connectivity, and rough edges. In this study, a water body segmentation model (DUPnet) with dense connectivity and multi-scale pyramidal pools is proposed to rapidly and accurately extract water bodies from Gaofen satellite and Landsat 8 OLI (Operational Land Imager) images. The proposed method includes three parts: (1) a multi-scale spatial pyramid pooling module (MSPP) is introduced to combine shallow and deep features for small water bodies and to compensate for the feature loss caused by the sampling process; (2) dense blocks are used to extract more spatial features to DUPnet’s backbone, increasing feature propagation and reuse; (3) a regression loss function is proposed to train the network to deal with the unbalanced dataset caused by small water bodies. The experimental results show that the F1, MIoU, and FWIoU of DUPnet on the 2020 Gaofen dataset are 97.67%, 88.17%, and 93.52%, respectively, and on the Landsat River dataset, they are 96.52%, 84.72%, 91.77%, respectively.


Introduction
Due to the advantages of large coverage, low cost, and short data acquisition period, remote sensing has been widely used in water body segmentation [1][2][3]. Water body segmentation is important for water resource management, ecological evaluation, and environmental protection [4,5]. The key to water body segmentation is to highlight the features of water bodies from complex backgrounds.
Traditional water body segmentation algorithms are used to extract water bodies directly by calculating a certain water body index, such as Normalized Difference Water Index (NDWI) [6] and Modified Normalized Difference Water Index (MNDWI) [7], and then by setting the corresponding thresholds. The water body index segmentation algorithm relies on thresholds that are set manually. The main aim of these indexes is to exploit the differences in the reflectance of water bodies at different wavelengths and to enhance the information about water bodies [8]. However, due to the diversity and complexity of the background, the different thresholds need to be adjusted for different scenarios. Machine learning-based water body segmentation algorithms build a relationship between water body samples and masks, which reduces the reliance on segmentation thresholds. Many popular algorithms such as Support Vector Machine (SVM) [9], Random Forests (RF) [10], entropy [37] is defined as a measure of the difference between two probability distributions for a random variable or event set and has been frequently used for pixel-level classification segmentation. However, cross-entropy has an obvious drawback when the image segmentation task requires only two cases to be segmented: foreground and background. When there are fewer foreground pixels, the background component of the loss function dominates, resulting in low segmentation accuracy. Furthermore, the Tversky index [38] can increase the weighting of false positives and false negatives, which effectively address the problem of data imbalance.
The main contributions of this study include the following: (a) A network framework (DUPnet) is proposed by combining the MSPP and dense block to segment water bodies from remote sensing images. The DUPnet uses multi-scale spatial features and multiple levels of spectral features. The experimental results on the 2020 Gaofen challenge water body segmentation dataset and the LR dataset show that the DUPnet model outperforms the majority of state-of-the-art segmentation methods, and the FWIoU for these two datasets are 93.52% and 91.77%, respectively. (b) A water body classification dataset is proposed based on Landsat 8 OLI images, namely the LR dataset, which contains 7154 images with 128 × 128 pixels and covers an area of~34,225 km 2 of the Yellow River in the Henan region, China. The LR dataset can provide the research community with high-quality datasets when conducting the water body classification for Landsat imagery. (c) A regression loss function (Log-Cosh Tversky Loss, all for short LCTLoss) was developed based on the Tversky index to address the imbalance of positive and negative samples. By modifying hyperparameters, our method is able to distinguish the water bodies with substantial edge changes. The experimental results on the LR dataset show that our proposed loss function outperforms Cross-Entropy (CE) Loss, Binary Cross-Entropy (BCE) Loss, Focal Loss, Dice Loss, and Tversky Loss, demonstrating the effectiveness of the proposed loss function.

Main Network Structure
As shown in Figure 1, we propose a DUP network based on the U-Net encoderdecoder network in conjunction with the dense block (DB) and the MSPP of DeeplabV3+. Specifically, the DUP network contains three parts: encoder, decoder, and skip connection, respectively. The encoder consists of convolutional layers (Conv), Batch Normalization (BN) [39], Rectified Linear Units (ReLU) [40], four dense blocks, and four down-sampling (Down) layers. The decoder uses five dense blocks and four up-sampling (Up) layers to recover the features. Skip connection uses the MSPP, and the segmented image is finally output by a classification layer.
Main features of the DUP network: (1) The encoder and decoder primarily use dense blocks, which are used to improve the network's ability to extract image semantic features and obtain highly abstracted feature maps. (2) Skip connections employ the MSPP based on Atrous Convolution to improve the feature utilization and compensate for the feature loss. (3) Down-sampling (Down) module that uses Atrous Separable Convolution (Sep Conv) [15] instead of the maximum pooling layer to increase the perceptual field of the feature map and improve the robustness of the network model.
Specifically, the DUP network uses the DB of DenseNet to establish connections between different layers, alleviate the problem of gradient disappearance, and enhance feature propagation to obtain clearer segmentation. As shown in Figure 2, one of the dense blocks assumes that the network has l layers, and layer l will accept the output features Remote Sens. 2022, 14, 5567 4 of 25 of all the predecessor network graphs as the input of layer l. The output of layer l is represented by a x l and x l is defined as: x l = H l ({x 0 , x 1 , . . . , x l−1 }) (1) where H l ( * ) represents the nonlinear transformation function, which is a combined operation including a series of BN, ReLU, and Conv operations. Main features of the DUP network: (1) The encoder and decoder primarily use dense blocks, which are used to improve the network's ability to extract image semantic features and obtain highly abstracted feature maps.
(2) Skip connections employ the MSPP based on Atrous Convolution to improve the feature utilization and compensate for the feature loss.
(3) Down-sampling (Down) module that uses Atrous Separable Convolution (Sep Conv) [15] instead of the maximum pooling layer to increase the perceptual field of the feature map and improve the robustness of the network model.
Specifically, the DUP network uses the DB of DenseNet to establish connections between different layers, alleviate the problem of gradient disappearance, and enhance feature propagation to obtain clearer segmentation. As shown in Figure 2, one of the dense blocks assumes that the network has l layers, and layer l will accept the output features of all the predecessor network graphs as the input of layer l. The output of layer l is represented by a l x and l x is defined as: where ( ) l H * represents the nonlinear transformation function, which is a combined operation including a series of BN, ReLU, and Conv operations.  Each DB contains four 1 × 1 convolutions, four 3 × 3 convolutions, and four feature fusions, as shown in Figure 2. The number of input feature maps can be reduced by introducing a 1 × 1 convolution before each 3 × 3 convolution. The BN and ReLU layers are added after each convolution layer of the DB [39].
In the BN layer, the mean and standard deviation computed for each batch are approximate estimates of the global mean and standard deviation, which introduces randomness into our search for the optimal solution and thus acts as a regularization.
An activation function is a function introduced to an artificial neural network to provide nonlinearity and enhance the expressive capability of the network. In this study, ReLU is chosen as the activation function. Firstly, using the ReLU activation function can save a lot of computation when back propagating to find the error gradient. Using functions such as the Sigmoid activation function involves division, which is computationally intensive and prone to gradient disappearance during back propagation. Secondly, ReLU Each DB contains four 1 × 1 convolutions, four 3 × 3 convolutions, and four feature fusions, as shown in Figure 2. The number of input feature maps can be reduced by introducing a 1 × 1 convolution before each 3 × 3 convolution. The BN and ReLU layers are added after each convolution layer of the DB [39].
In the BN layer, the mean and standard deviation computed for each batch are approximate estimates of the global mean and standard deviation, which introduces randomness into our search for the optimal solution and thus acts as a regularization.
An activation function is a function introduced to an artificial neural network to provide nonlinearity and enhance the expressive capability of the network. In this study, ReLU is chosen as the activation function. Firstly, using the ReLU activation function can save a lot of computation when back propagating to find the error gradient. Using functions such as the Sigmoid activation function involves division, which is computationally intensive and prone to gradient disappearance during back propagation. Secondly, ReLU can make the network sparse, reducing the interdependence of parameters and alleviating overfitting.
As shown in Figure 3a, skip connections leverage the MSPP to combine shallow data from the encoding stage with deep information from the decoding step. Skip connection facilitates feature fusion in the U-Net network by clipping shallow features and splicing them with deep features. This strategy is not useful, as it causes the skip connection's output to be a specific region in the input's center, which will affect the accuracy of feature extraction from the image's edges. In the DeeplabV3+ network, the ASPP scheme uses multiple parallel Atrous Convolutional Layers with various dilation rates to increase the perceptual field. The MSPP constructs the results by extracting the multi-scale features from the encoder layer output and fusing them with the decoder layer output, resulting in more dense multi-scale feature data. The MSPP is used as the skip connection in the DUPnet network.
In the BN layer, the mean and standard deviation computed for each batch are approximate estimates of the global mean and standard deviation, which introduces randomness into our search for the optimal solution and thus acts as a regularization.
An activation function is a function introduced to an artificial neural network to provide nonlinearity and enhance the expressive capability of the network. In this study, ReLU is chosen as the activation function. Firstly, using the ReLU activation function can save a lot of computation when back propagating to find the error gradient. Using functions such as the Sigmoid activation function involves division, which is computationally intensive and prone to gradient disappearance during back propagation. Secondly, ReLU can make the network sparse, reducing the interdependence of parameters and alleviating overfitting.
As shown in Figure 3a, skip connections leverage the MSPP to combine shallow data from the encoding stage with deep information from the decoding step. Skip connection facilitates feature fusion in the U-Net network by clipping shallow features and splicing them with deep features. This strategy is not useful, as it causes the skip connection's output to be a specific region in the input's center, which will affect the accuracy of feature extraction from the image's edges. In the DeeplabV3+ network, the ASPP scheme uses multiple parallel Atrous Convolutional Layers with various dilation rates to increase the perceptual field. The MSPP constructs the results by extracting the multi-scale features from the encoder layer output and fusing them with the decoder layer output, resulting in more dense multi-scale feature data. The MSPP is used as the skip connection in the DUPnet network.  The maximum pooling reduces the spatial resolution of the produced feature maps, which causes feature information loss. As a result, as shown in Figure 3b, the downsampling module replaces the original maximum pooling layer with a 3 × 3 null separable convolutions (Sep Conv) with a stride of 2. This convolution is a depth-separable convolution with stride, which decomposes the standard convolution into the Depthwise Convolution [41] and the Pointwise Convolution. As shown in Figure 4, the Depthwise Convolution has a broader sensory field, which can effectively improve the shortcomings of maximum pooling. Deep convolution uses spatial convolution for each channel independently, with pointwise convolution used to integrate the results. Convolution with depth separation can considerably reduce the number of parameters and processing effort. Null convolution may extract more features response by increasing the null rate and enabling larger overlapping sampled regions on the input feature map at each sampling, but conventional convolution can only extract small chunks of features when the number of parameters is known. Null-separable convolution not only reduces feature loss but allows for feature extraction on feature maps of any resolution.
The up-sampling operation uses transposed convolution to increase the spatial dimensionality of the feature map. For pixel-level segmentation, the image size needs to be restored to its original size. The feature map output from the network's middle layer is deconvolved to pixel space by deconvolution. pendently, with pointwise convolution used to integrate the results. Convolution with depth separation can considerably reduce the number of parameters and processing effort. Null convolution may extract more features response by increasing the null rate and enabling larger overlapping sampled regions on the input feature map at each sampling, but conventional convolution can only extract small chunks of features when the number of parameters is known. Null-separable convolution not only reduces feature loss but allows for feature extraction on feature maps of any resolution.  The up-sampling operation uses transposed convolution to increase the spatial dimensionality of the feature map. For pixel-level segmentation, the image size needs to be restored to its original size. The feature map output from the network's middle layer is deconvolved to pixel space by deconvolution.
The parameters of each layer in the DUP network are set as shown in Table 1.  The parameters of each layer in the DUP network are set as shown in Table 1.

. Main Network Tversky Coefficients-Based Regression Loss Function
In the field of computer vision, the Dice coefficient is a frequently used statistic for calculating picture similarity. It has also been refined into a loss function, Dice Loss [31]. The Tversky Index (TI) [38] is an extension of the Dice and Jaccard coefficients, and the Tversky Index is defined as follows: The Tversky coefficient is the Dice coefficient when β = 0.5, while the Tversky Index is the Jaccard coefficient when β = 1. The trade-off between false positives and false negatives can be controlled by adjusting the hyperparameter β. In Equation (3), β = 0.7. Tversky Loss (TL) is defined as follows: As Tversky Loss supports the mathematical formulation of the segmentation objective, Tversky Loss has also been adapted for use as a loss function. However, as it is non-convex, Tversky Loss may not produce optimal outcomes. As for regression-based problems, Log-Cosh has been commonly utilized to smooth the curve [28]. In terms of nonlinearity, deep learning algorithms have used hyperbolic functions, such as tan layers, which are simple to manage and identify. The functional expression of cosh(x) is defined as follows: However, the range of cosh(x) can rise to infinity. Therefore, to facilitate the calculation in the range of values, the log function is used, and log(*) is a logarithmic function with the natural number e as the base. The expression of the Log-Cosh function is defined as follows: The derivative of L(x) with respect to x is (6): As the range of tan(x) is (−1, 1), L(x) is continuous and finite. Log-Cosh will remain continuous and finite after first-order differentiation, according to the foregoing proof. In this study, we present the Log-Cosh Tversky Loss function (LCTLoss), which is simple to construct while retaining the properties of the Tversky coefficient and cross-entropy. LCTLoss is defined as follows: where y t is the true value of the prediction model, and y p is the predicted value of the prediction model.

Implementation Details and Evaluation Indexes
All experiments were run on an NVIDIA GeForce RTX 3070 graphics card with Python 3.8 and PyTorch 1.9.0 on Windows 10. The network model was optimized with the RM-Sprop [42] optimizer, which iterated until the best model was found. 5 × 10 −4 and 0.90 were used as the weight decay and momentum parameters, respectively [23]. The model was trained for 150 epochs using a batch size of 8 epochs, an initial learning rate of 0.001, and the learning rate was dynamically adjusted using the poly method [43]. The evaluation metrics used in this paper include Recall, Precision, Accuracy, Mean Intersection over Union (MIoU), Frequency Weighted Intersection over Union (FWIoU), and F1 and are shown as follows: The confusion matrix between the water body masks and ground truths is calculated, consisting of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The positive and negative represent water and background, respectively.

Materials
To validate our approach, we conducted experiments on the 2020 Gaofen challenge water body segmentation dataset [44] and the Landsat River dataset created in this study. Figure 5 shows a sample image and mask from these datasets.

Materials
To validate our approach, we conducted experiments on the 2020 Gaofen challenge water body segmentation dataset [44] and the Landsat River dataset created in this study. Figure 5 shows a sample image and mask from these datasets. 2020 Gaofen challenge water body segmentation dataset LR dataset

Gaofen Dataset
We chose the 2020 Gaofen Challenge Water Segmentation Dataset, which is the only high-resolution optical dataset for water body classification. The dataset contains 1000 RGB images from the GF-2 satellite with a pixel resolution of 1-4 m and an image size of 492 × 492. We expanded to 8000 images by rotating, blurring, brightening, darkening, and adding noise. This dataset was partitioned into training, validation, and test sets with a scale of 6:2:2.

Landsat River Dataset
Based on the remote sensing images of the Yellow River in the Henan region, we created a new dataset called the Landsat River dataset (LR dataset) to further evaluate the performance of the proposed network. Images from Landsat 8 satellite were downloaded freely from the USGS website (https://earthexplorer.usgs.gov/, accessed on 24 August 2022). The Landsat 8 remote sensing satellite is made up of two sensors, Operational Land Imager (OLI) and Thermal Infrared Sensor (TIRS), which can provide some daily images at 16-day intervals. Each scan generates an image with an area of ~34,225 km 2 . The optical remote sensing data in this study are collected from the Landsat 8 OLI, and the specific information used in the experiment is shown in Table 2.

Gaofen Dataset
We chose the 2020 Gaofen Challenge Water Segmentation Dataset, which is the only high-resolution optical dataset for water body classification. The dataset contains 1000 RGB images from the GF-2 satellite with a pixel resolution of 1-4 m and an image size of 492 × 492. We expanded to 8000 images by rotating, blurring, brightening, darkening, and adding noise. This dataset was partitioned into training, validation, and test sets with a scale of 6:2:2.

Landsat River Dataset
Based on the remote sensing images of the Yellow River in the Henan region, we created a new dataset called the Landsat River dataset (LR dataset) to further evaluate the performance of the proposed network. Images from Landsat 8 satellite were downloaded freely from the USGS website (https://earthexplorer.usgs.gov/, accessed on 24 August 2022). The Landsat 8 remote sensing satellite is made up of two sensors, Operational Land Remote Sens. 2022, 14, 5567 9 of 25 Imager (OLI) and Thermal Infrared Sensor (TIRS), which can provide some daily images at 16-day intervals. Each scan generates an image with an area of~34,225 km 2 . The optical remote sensing data in this study are collected from the Landsat 8 OLI, and the specific information used in the experiment is shown in Table 2. The raw remote sensing images are processed using the ENVI 5.6.1 platforms (Exelis Visual Information Solutions, Boulder, CO, USA) [45], which includes pre-processing, water-body masking, and data set delineation.
The main steps of pre-processing are as follows: (1) The multispectral band data with a resolution of 30 m and the panchromatic band data (Band 8) with a resolution of 15 m are sharpened using the Brovey Transform [46] to obtain the high-resolution remote sensing image (15 m) and preserve the multispectral information.
(2) The radiometric calibration operation is used to reduce errors generated by the sensor itself, the atmospheric correction operation is used to recover the spectral information of features, and the orthorectification operation is used to avoid geometric distortions in the image. (3) The bands NIR (Band 5), SWIR1 (Band 6), and Red (Band 4) as selected as the red, green, and blue channels, respectively, to obtain the false color composite images. The combined false color composite images are easier to identify water bodies, rivers, lakes, and other large and small pools. These pre-processing steps are illustrated in Figure 6. green, and blue channels, respectively, to obtain the false color composite images. The combined false color composite images are easier to identify water bodies, rivers, lakes, and other large and small pools. These pre-processing steps are illustrated in Figure 6. Support Vector Machine (SVM) [47] is a nonparametric statistical method used to solve supervised classification and regression issues. The underlying assumption is that separating two classes in the feature space is analogous to establishing an appropriate hyperplane, at least for linearly separable variables [48]. Determining a separation hyperplane geometrically is equal to identifying specific observations, known as support vectors, that best describe the problem's classes. The marginal parameters define the degree of separation between the data as well as the number of support vectors required by the model.
With the advent of support vector machines [48,49], the accuracy and efficiency of remote sensing image classifying are greatly improved, resulting in a competitive option for remote sensing applications on sample masking [50]. We use the SVM classifier based on the ENVI 5.6.1 platforms to generate a water body mask and set the kernel type of the SVM classifier to Polynomial; other parameters refer to the official ENVI document [51].
The main steps to create the dataset include: Support Vector Machine (SVM) [47] is a nonparametric statistical method used to solve supervised classification and regression issues. The underlying assumption is that separating two classes in the feature space is analogous to establishing an appropriate hyperplane, at least for linearly separable variables [48]. Determining a separation hyperplane geometrically is equal to identifying specific observations, known as support vectors, that best describe the problem's classes. The marginal parameters define the degree of separation between the data as well as the number of support vectors required by the model.
With the advent of support vector machines [48,49], the accuracy and efficiency of remote sensing image classifying are greatly improved, resulting in a competitive option for remote sensing applications on sample masking [50]. We use the SVM classifier based on the ENVI 5.6.1 platforms to generate a water body mask and set the kernel type of the SVM classifier to Polynomial; other parameters refer to the official ENVI document [51].
The main steps to create the dataset include: (1) SVM classifier: The images described in Table 2 are selected as training and validation data. Due to the large size of these three images, we use the ENVI software platform to divide each image into four parts, where one-quarter of each image is used for SVM classifier training, and the remaining three-quarters are for validation. Taking the image dated 2021/03/22 as an example, the following steps are performed: The ROI (Region of Interest) tool is used to construct the region of the water body (training sample).
The SVM classifier is then applied to extract water bodies. (c) The Interactive Class Tool is utilized to manually modify the misclassified or omitted image attributes, and water body spectral curves were utilized to assist in this work to obtain the classification results of the water bodies. (d) The water body classification results are transformed into ROI to extract the water bodies, and then repeat steps (b) and (c) to obtain the water body extraction results of the complete remote sensing images.
(2) Image selection: The images are cropped to the water-body masks of remote sensing images and the remote sensing images corresponding to the water-body masks into 128 × 128 size images. In addition, the water body masks are manually corrected again in the cropped images to form the LR dataset, which contains 7154 images. This dataset was partitioned into training, validation, and test sets with a scale of 6:2:2. (3) Image augmentation: Data augmentation [52] is undertaken before model training and improves sample diversity and the training model's generalization performance.
In the remote sensing image water-body dataset, image enhancement is conducted on all images in the training set and validation set by conducting horizontal flip [53], random Gaussian blurring [54], and normalization [55].

Quantitative Comparison of Ablation Study
Using U-Net as a baseline, an ablation experiment analysis is performed in this work to test the efficiency of MSPP and Dense Block (DB). Where MSPP is introduced into U-Net as a skip connection and DB replaces the original convolutional layer of the U-Net structure. Firstly, to investigate the effect of different dilation rates on MSPP, three distinct dilation rates were chosen: {1, 6, 9, 12}, {1, 6, 12, 18}, and {1, 12, 24, 36}. As shown in Table 3, with the increase in dilation rate, F1 and MIoU increase first and then decrease. A dilution rate of {1,6,12,18} resulted in the highest F1, MIou, and FWIoU (Table 3). When the dilation rate is low, the model is less accurate in predicting difficult-to-classify pixels but more accurate in predicting easy-to-classify pixels. The prediction of hard-to-classify pixels may improve, but the prediction of easy-to-classify pixels may degrade. Therefore, we chose a dilation rate of {1, 6, 12, 18} in this study, which can make the MSPP more effective. To ensure that the MSPP and DB approaches contribute to the results of water segmentation. As shown in Table 4, adding the MSPP module to U-Net can enhance F1, MIoU, and FWIoU by 1.2%, 2.69%, and 2.79%, respectively, indicating that the MSPP is efficient in water segmentation. The addition of the DB module to U-Net only increases the MIoU by 0.17%, but the addition of the DB module makes the network extract more water body pixels, as shown in Figures 7 and 8. The DUPnet uses MSPP for skip connections and DB modules; its F1, MIoU, and FWIoU increase by 1.84%, 4.96%, and 4.47%, respectively, compared to U-Net.

Qualitative Comparison of Ablation Study
This study analyzes the segmentation features created independently before and after using the MSPP and DB to visually confirm the effectiveness of the proposed module. The study uses yellow boxes to highlight the locations where there are discrepancies in identification. Figures 7 and 8 depict the effect of the original U-Net, the U-Net with MSPP as skip connections, and the U-Net with DB as a convolutional layer for segmenting water bodies.
As shown in Figure 7, the U-Net with the MSPP and DB module recognizes more water from the input image, extracts more features, and improves recognition of difficultto-extract regions such as shadows than the original U-Net.
As shown in Figure 8, comparison images monitor small water bodies, the original U-Net can only track a tiny percentage of narrow streams on the input image. The addition of MSPP and DB modules can improve the model's ability to locate water bodies, extract small water bodies, and reduce water body misclassification and omission.

Image
Mask U-Net U-Net + MSPP U-Net + Dense Block DUPnet

Qualitative Comparison of Ablation Study
This study analyzes the segmentation features created independently before and after using the MSPP and DB to visually confirm the effectiveness of the proposed module. The study uses yellow boxes to highlight the locations where there are discrepancies in identification. Figures 7 and 8 depict the effect of the original U-Net, the U-Net with MSPP as skip connections, and the U-Net with DB as a convolutional layer for segmenting water bodies.
As shown in Figure 7, the U-Net with the MSPP and DB module recognizes more water from the input image, extracts more features, and improves recognition of difficultto-extract regions such as shadows than the original U-Net.

Quantitative Comparison with State-of-the-Art Methods
To comprehensively evaluate the segmentation performance of the improved DUPnet model, eight segmentation networks, FCN, SegNet, U-Net, ENVINet5 [56], PSPNet, DeepLabV3+, HRNet V2, and Maximum Likelihood Classification (MLC) [57], were chosen for comparison in this study, and the performance of the trained models was tested using the test set. The FCN, SegNet, U-Net, PSPNet, DeepLabV3+, HRNet V2, and DUPnet networks were trained using the network parameter settings in Section 2.1.3, with the same backbone network Resnet50 used for FCN, PSPNet, and DeepLabV3+. In addition, the best hyper-parameters of the segmentation algorithms were used in the comparison approach.
As shown in Table 5 and Figure 9, the results of the competence evaluation comparison between DUPnet and other network segmentation results on the LR dataset. The probabilistic discriminatory rules-based MLC has the highest Accuracy of 99.82%; the U-Net performs the best in Recall with 97.62%; the proposed DUPnet has the highest Precision of 97.15%.    Figure 10 provide the comparison of our method with state-of-the-art methods on the 2020 Gaofen challenge water body segmentation dataset and the LR dataset in F1, MIoU, and FWIoU. In the case of both datasets, DUPnet achieves the most superior performance for each of the three metrics because DUPnet uses both dense blocks, contextual aggregation, and multi-scale skip connection, which gives it an advantage over the other methods. As shown in Table 7, to evaluate the complexity of the compared models, the size of the memory occupied by the model files and the average time used to predict one image   As shown in Table 7, to evaluate the complexity of the compared models, the size of the memory occupied by the model files and the average time used to predict one image  To evaluate the segmentation ability of the compared models, the evaluation of the water segmentation ability of the models of FCN, SegNet, U-Net, PSPNet, DeepLabV3+, and DUPnet in the LR dataset after the completion of training using ROC (Receiver Operating Characteristic) curves and P-R curves are shown in Figure 11a,b. AUC (Area under Curve), the area under the Roc curve, is between 0.1 and 1. AUC is a value that can visually evaluate the goodness of the classifier. A larger value of AUC represents a better result. In the P-R curve diagram, if the curve bend towards the upper right corner, i.e., (1, 1), the segmentation performance of the corresponding model is better. SegNet has an AUC of 0.87 and has the smallest area under the ROC curve and the P-R curve. Therefore, it had the worst performance for segmentation. The proposed DUPnet has an AUC of 0.98 and contains the largest area under the ROC curve. The P-R curve of DUPnet is closer to the upper right corner compared to the other methods. Therefore, the ROC and P-R curve plots indicate that the proposed DUPnet has better model segmentation ability when compared to other methods.
AUC of 0.87 and has the smallest area under the ROC curve and the P-R curve. Therefore, it had the worst performance for segmentation. The proposed DUPnet has an AUC of 0.98 and contains the largest area under the ROC curve. The P-R curve of DUPnet is closer to the upper right corner compared to the other methods. Therefore, the ROC and P-R curve plots indicate that the proposed DUPnet has better model segmentation ability when compared to other methods.

Qualitative Comparison with State-of-the-Art Methods
A qualitative comparison of the performance of the proposed method and comparable methods on the LR dataset and the 2020 Gaofen challenge water body segmentation

Qualitative Comparison with State-of-the-Art Methods
A qualitative comparison of the performance of the proposed method and comparable methods on the LR dataset and the 2020 Gaofen challenge water body segmentation dataset was also conducted. Figure 12 depicts the qualitative evaluation in comparison to several methodologies. Figure 12 presents examples of water bodies extraction results from the Gaofen dataset in the first through fourth columns. From the results, our proposed deep segmentation network DUPnet can integrate the properties of three networks, including codec structure, dense connection, and Atrous Convolution, to increase the extraction accuracy of water bodies of various types (rivers, water fields, and water channels).
The images in columns 5 and 6 of Figure 12 show urban waters of various sizes and shapes of our created LR dataset. FCN, PSPNet, and DeeplabV3+ have unclear boundary segmentation of urban waters, and small water bodies in the city are not detected, but the discrimination ability of error-prone sub-pixels such as urban building shadows is good. SegNet has poor discrimination ability for the central region of water bodies. U-Net, ENVINet5, and MLC are effective in extracting urban water bodies of various sizes, but hard-to-identify regions such as urban building shadows are misclassified as water bodies.
As shown in column 7 of Figure 12, the image background is a mountain of the LR dataset, FCN, PSPNet, and DeeplabV3+ have high accuracy and less error for pixel segmentation of water bodies in the mountains. U-Net, SegNet, ENVINet5, and MLC are easy to confuse hill shadows and water bodies, and the segmentation accuracy is not very high. The proposed DUPnet retains a lot of spatial details for the segmentation of water bodies in the mountains, the edges are clearer and more accurate, and these easily mis-segmented pixels of the hill shadows can be well distinguished.
The images in columns 8 and 9 of Figure 12 show rivers of varying widths of the LR dataset, and the river water bodies segmented by U-Net are discontinuous, whereas the other approaches find continuous water bodies. SegNet segments river water bodies with gaps, while the ENVINet5 misclassifies roads with a similar color to water bodies as water bodies. FCN, PSPNet, DeeplabV3+, and MLC have a poor resolution for river tributaries. The proposed DUPnet is continuous and can discriminate the hard-to-identify river tributaries. This model produces water-body segmentation images on the LR dataset with more clear and more accurate segmentation edges and retains more water-body details. dataset was also conducted. Figure 12 depicts the qualitative evaluation in comparison to several methodologies.  In addition, as shown in Figure 13, FCN, SegNet, PSPNet, and DeeplabV3+ distinguish building shadows with the best result, but the recognition of small waters is less evident; U-Net, ENVINet5, and MLC identify more building shadows as water bodies; our proposed method has a complete and clear boundary, and identifies more water pixels, which has the best segmentation performance on narrow streams and point water. utaries. This model produces water-body segmentation images on the LR dataset with more clear and more accurate segmentation edges and retains more water-body details.
In addition, as shown in Figure 13, FCN, SegNet, PSPNet, and DeeplabV3+ distinguish building shadows with the best result, but the recognition of small waters is less evident; U-Net, ENVINet5, and MLC identify more building shadows as water bodies; our proposed method has a complete and clear boundary, and identifies more water pixels, which has the best segmentation performance on narrow streams and point water.

Comparative Analysis of Different Loss Functions
Additionally, on the LR dataset, we conduct a comparative experiment for the Cross-Entropy (CE) Loss, Binary Cross-Entropy (BCE) Loss, Focal Loss, Dice Loss, Tversky Loss, and our proposed LCTLoss to evaluate the effects of different loss functions on model performance.
We evaluated the performance of U-Net and SegNet under six different loss functions (Table 8 and Figure 14). U-Net achieved the optimal results using our proposed LCT-Loss where Accuracy, Recall, F1, MIoU, and FWIoU proposed in this work are 90.23%, 97.62%, 92.67%, 78.37%, and 85.85%, respectively. SegNet achieved the optimal results using our proposed LCTLoss in this paper, where the Accuracy, Recall, F1, and FWIoU are

Comparative Analysis of Different Loss Functions
Additionally, on the LR dataset, we conduct a comparative experiment for the Cross-Entropy (CE) Loss, Binary Cross-Entropy (BCE) Loss, Focal Loss, Dice Loss, Tversky Loss, and our proposed LCTLoss to evaluate the effects of different loss functions on model performance.
We evaluated the performance of U-Net and SegNet under six different loss functions (Table 8 and Figure 14). U-Net achieved the optimal results using our proposed LCTLoss where Accuracy, Recall, F1, MIoU, and FWIoU proposed in this work are 90.23%, 97.62%, 92.67%, 78.37%, and 85.85%, respectively. SegNet achieved the optimal results using our proposed LCTLoss in this paper, where the Accuracy, Recall, F1, and FWIoU are 90.30%, 91.96%, 93.00%, and 83.74%, respectively.  We evaluated the impact of six loss functions on the segmentation ability of U-Net and SegNet by using ROC curves and P-R curves ( Figure 15). It can be seen from Figure  15a that the U-Net model performed the best using our loss function with an AUC of 0.97 on the ROC curve and the worst using Dice loss with an AUC of 0.88. According to Figure  15b, U-Net has the closest curve to our loss function and the BCE loss on the P-R curve, but the curve representing the red curve of our loss function contains a large area under the curve, so our loss function is better, while the curve of dice loss contains the smallest area under the curve, so the segmentation ability is inadequate. We evaluated the impact of six loss functions on the segmentation ability of U-Net and SegNet by using ROC curves and P-R curves ( Figure 15). It can be seen from Figure 15a that the U-Net model performed the best using our loss function with an AUC of 0.97 on the ROC curve and the worst using Dice loss with an AUC of 0.88. According to Figure 15b, U-Net has the closest curve to our loss function and the BCE loss on the P-R curve, but the curve representing the red curve of our loss function contains a large area under the curve, so our loss function is better, while the curve of dice loss contains the smallest area under the curve, so the segmentation ability is inadequate.
As shown in Figure 15c,d, the SegNet evaluates the effect of six loss functions on the segmentation ability of the model using the ROC curve and P-R curve. According to Figure 15c, the BCE loss performs best on the ROC curve for the SegNet model, with an AUC of 0.88, while the Tversky Loss and Focal Loss both perform poorly, and both have equal AUC of 0.82. From Figure 15d, on the P-R curve, the SegNet has the best segmentation ability, with the curve of our loss function closest to the top-right vertex and the curve of Focal Loss closest to the coordinate origin, indicating that the segmentation effect is undesirable. We evaluated the impact of six loss functions on the segmentation ability of U-Net and SegNet by using ROC curves and P-R curves ( Figure 15). It can be seen from Figure  15a that the U-Net model performed the best using our loss function with an AUC of 0.97 on the ROC curve and the worst using Dice loss with an AUC of 0.88. According to Figure  15b, U-Net has the closest curve to our loss function and the BCE loss on the P-R curve, but the curve representing the red curve of our loss function contains a large area under the curve, so our loss function is better, while the curve of dice loss contains the smallest area under the curve, so the segmentation ability is inadequate. As shown in Figure 15c,d, the SegNet evaluates the effect of six loss functions on the segmentation ability of the model using the ROC curve and P-R curve. According to Figure 15c, the BCE loss performs best on the ROC curve for the SegNet model, with an AUC of 0.88, while the Tversky Loss and Focal Loss both perform poorly, and both have equal AUC of 0.82. From Figure 15d, on the P-R curve, the SegNet has the best segmentation ability, with the curve of our loss function closest to the top-right vertex and the curve of Focal Loss closest to the coordinate origin, indicating that the segmentation effect is undesirable.
Some representative images for analyzing and observing the performance of the six loss functions are shown in Figures 16 and 17, which represent a qualitative comparison of U-Net and SegNet for water body segmentation using these six loss functions, respectively. It can be seen from lines 1 and 2 of Figure 16 that U-Net identifies more pixels of the water bodies using our loss function. Lines 3 and 4 of Figure 16 show that U-Net only uses our loss function to segment the complete water bodies with clear boundaries. Line 5 of Figure 16 shows that U-Net makes Dice loss, and our loss function to identify more water bodies, but the water-body boundaries are not clear. As our proposed loss function set a fixed hyperparameter β, which does not reach the optimal positive and negative sample balance, it may have an effect on some water body pixels that are hard to distinguish. In this regard, we will continue to investigate the adaptive hyperparameter of the loss Some representative images for analyzing and observing the performance of the six loss functions are shown in Figures 16 and 17, which represent a qualitative comparison of U-Net and SegNet for water body segmentation using these six loss functions, respectively. It can be seen from lines 1 and 2 of Figure 16 that U-Net identifies more pixels of the water bodies using our loss function. Lines 3 and 4 of Figure 16 show that U-Net only uses our loss function to segment the complete water bodies with clear boundaries. Line 5 of Figure 16 shows that U-Net makes Dice loss, and our loss function to identify more water bodies, but the water-body boundaries are not clear. As our proposed loss function set a fixed hyperparameter β, which does not reach the optimal positive and negative sample balance, it may have an effect on some water body pixels that are hard to distinguish. In this regard, we will continue to investigate the adaptive hyperparameter of the loss function in the future. uses our loss function to segment the complete water bodies with clear boundaries. Line 5 of Figure 16 shows that U-Net makes Dice loss, and our loss function to identify more water bodies, but the water-body boundaries are not clear. As our proposed loss function set a fixed hyperparameter β, which does not reach the optimal positive and negative sample balance, it may have an effect on some water body pixels that are hard to distinguish. In this regard, we will continue to investigate the adaptive hyperparameter of the loss function in the future. As shown in Figure 17, lines 1 and 2 show that SegNet employs Focal Loss to missegment shadows into bodies of water, but SegNet uses our loss function to segment the water bodies more effectively. As shown in lines 3 and 4 of Figure 17, SegNet captures more pixels of the water bodies using only our loss function. As shown in line 5 of Figure  17, SegNet did not identify the water bodies in the image using BCE Loss and Tversky Loss, and the other methods identified water body pixels with missed scores, but SegNet identified more details of water bodies using our loss function.

Discussion
In this study, from the experimental results in Section 3.1, the remote sensing water segmentation provided by our proposed approach is the most effective and makes full use of multi-scale features. Dense blocks can improve the use of features, and the MSPP As shown in Figure 17, lines 1 and 2 show that SegNet employs Focal Loss to missegment shadows into bodies of water, but SegNet uses our loss function to segment the water bodies more effectively. As shown in lines 3 and 4 of Figure 17, SegNet captures more pixels of the water bodies using only our loss function. As shown in line 5 of Figure  17, SegNet did not identify the water bodies in the image using BCE Loss and Tversky Loss, and the other methods identified water body pixels with missed scores, but SegNet identified more details of water bodies using our loss function.

Discussion
In this study, from the experimental results in Section 3.1, the remote sensing water segmentation provided by our proposed approach is the most effective and makes full As shown in Figure 17, lines 1 and 2 show that SegNet employs Focal Loss to missegment shadows into bodies of water, but SegNet uses our loss function to segment the water bodies more effectively. As shown in lines 3 and 4 of Figure 17, SegNet captures more pixels of the water bodies using only our loss function. As shown in line 5 of Figure 17, SegNet did not identify the water bodies in the image using BCE Loss and Tversky Loss, and the other methods identified water body pixels with missed scores, but SegNet identified more details of water bodies using our loss function.

Discussion
In this study, from the experimental results in Section 3.1, the remote sensing water segmentation provided by our proposed approach is the most effective and makes full use of multi-scale features. Dense blocks can improve the use of features, and the MSPP gives the decoder additional spatial information on shallow features so that the relationship between pixels and masks may be precisely described, resulting in the best water-body segmentation. We also analyzed the water segmentation performance of DUPnet on two datasets: the LR dataset and the 2020 Gaofen challenge water body segmentation dataset. We found 16 mislabeled images in the LR dataset, which are shown on our publicly available website (https://github.com/xuemeichen99/DUPnet-Pytorch, accessed on 3 November 2022). These mislabeled data represent 0.22% of the total dataset. Most of the mislabeling images are located at the mixing area of water bodies and land, which makes it difficult for the human eyes to distinguish. This problem also exists for the Gaofen Challenge 2020 water body segmentation dataset with masks. Furthermore, the compute gradients are normalized at each iteration with RMSprop of the optimization algorithm, which helps reduce the impact of mis-masked samples. From the experimental results in Section 3.2, some representative cases demonstrate the capability of the proposed DUPnet method and other methods in identifying tributaries and watersheds at different scales. We used the MSPP as skip connections in the DUP network to collect multi-scale spatial domain information from remote sensing images. The MSPP can extract multi-scale features from the encoder layer's down-sampling output, which has shallow layers and excellent feature resolution. The network can maintain more high-resolution detail information embedded in the high-level feature maps by fusing the multi-scale features with the decoder's high-level features, thus improving image segmentation accuracy. The proposed DUPnet has the highest F1, MIoU, and FWIoU on the LR dataset and the 2020 Gaofen challenge water body segmentation dataset. From the results in Figures 12 and 13, DUPnet provides a superior segmentation effect and retains the most water body information, particularly when extraction water fields narrow rivers. In addition, we proposed a loss function called LCTLoss and conducted a comparative analysis of different loss functions based on the experimental results in Section 3.3. As shown in Table 8 and Figures 14-17, the result with LCTLoss is better than those with other loss function methods.
To summarize, the DUPnet network created and designed in this study has the following advantages: (1) Strong capacity for feature extraction: The encoder and decoder rely heavily on DB modules, which improve the network's ability to extract semantic picture characteristics and provide highly abstracted feature images. (2) Minor loss of feature specifics: The skip connection employs a multi-scale spatial pyramidal pooling MSPP based on Atrous Convolution to enhance the use of features and compensate for the loss of information. (3) Large feature image perceptual field: Down-sampling module employs depth-separate convolution in replace of maximum pooling layer to enlarge the perceptual field of the feature map and enhance the robustness of the image features.

Limitations
Due to the limited spectral range of optical remote sensing images, the variety in water body shape and size, and the cloud coverage, water body masks for datasets may show a difference from the ground truth. Our proposed model also needs a high-quality dataset for training. However, there is a dearth of suitable datasets for supervised training in practice. Although incorporating a deep learning neural network model into the training and learning phase of remote sensing image recognition and extraction has the potential to better utilize image feature information, eliminate interference noise, and automate recognition and extraction, its accuracy is limited by the size and breadth of the training set. Future optimization of the model is also needed for training on more datasets.
To address some of the limitations, band combination could be used to reduce the background interference and help to more closely match the ground truth values. Other classifiers could also be used for dataset annotation to improve the accuracy and efficiency of dataset production. A dehaze model may also be developed to address the problem of cloud coverage on remote sensing images. Better model pruning and compressing mechanisms could also be investigated to further improving the performance of the proposed DUPnet model.

Conclusions
To improve the existing semantic segmentation algorithm of water bodies on remote sensing images and the limitations, a water body segmentation method based on dense blocks and the multi-scale pyramid pooling module (DUPnet) is proposed. We determined that the DUPnet can use the dense blocks to learn and propagate features, and the encoder part applies Atrous Separable Convolution (Sep Conv) down-sampling to increase the perceptual field of shallow feature maps and improve the robustness of image features. The skip connections can use the MSPP to extract multi-scale features of the encoder part layer to obtain multi-scale features. The up-sampling features are merged with multi-scale features in the decoder component, complementing the semantic and spatial information and boosting the decoder module's images recovery capacity. A regression loss function based on Tversky coefficients and Log-Cosh regression is proposed in the deep learning model training, which can effectively improve the serious imbalance of positive and negative samples. In addition, we provide a fast method to generate datasets that can be used to train deep learning models. We selected the 5-6-4 bands combination of Landsat 8 OLI images to reduce the background interference. Then, we introduced ENVI SVM classifier for dataset annotation and two rounds of manual correction to improve the accuracy and efficiency of dataset production. This proved to provide good data support for extracting different types of water bodies in Landsat 8. This study efficiently resolves the technical problems of the inefficiency of water body sample masking, the difficulty of extracting small water bodies, the poor flexibility of extraction methods, and the lack of precision. The superiority of the proposed method for water segmentation on the 2020 Gaofen challenge water body segmentation dataset and the LR dataset is demonstrated in this study through ablation experiments and comparisons with comparable methods. DUPnet has the highest Precision, F1, MIoU, and FWIoU with values of 97.15%, 96.52%, 84.72%, and 91.77% on the LR dataset, respectively. On the 2020 Gaofen challenge water body segmentation dataset, DUPnet also has the highest F1, MIoU, and FWIoU with 97.67%, 88.17%, and 93.52%, respectively. To communicate with the researcher, the LR dataset has been provided online at https://github.com/xuemeichen99/DUPnet-Pytorch (accessed on 3 November 2022).
SVM classifiers can decrease the time and labor of annotation; however, they are generally very sensitive to the selection of appropriate kernel functions and parameter settings in remote sensing segmentation [49]. The superb fitting ability and portability of deep learning special have made the field of deep learning image algorithms highly popular. The combination of deep learning and machine learning methods will continue to be actively explored. For the dataset, we will continue to create larger, high-resolution multisource datasets and find simpler and faster ways to improve the quality of the annotations. We will explore the weight assignment of the hybrid loss function in an adaptive manner. In addition, we will prune and compress the network to reduce the number of parameters as well as the processing time of the network while ensuring the performance of the model.